Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Imagen: Breaking LLM Intuition (By: Haleema Tal...

Imagen: Breaking LLM Intuition (By: Haleema Tallat) - DevFest Lahore 2025

Talk by Haleema Tallat (https://www.linkedin.com/in/haleema-tallat/) at DevFest Lahore 2025 by GDG Lahore.

Avatar for GDG Lahore

GDG Lahore PRO

December 20, 2025
Tweet

More Decks by GDG Lahore

Other Decks in Programming

Transcript

  1. LLMs trained us to believe: ◦ Retries fix errors and

    improve outputs ◦ Prompts specify precise outcomes ◦ Evaluation is cheap and straightforward ◦ Determinism is achievable with effort
  2. Begin with xT, completely random pixels Gradually refine towards x₀

    through learned steps All pixels update. Every denoising step modifies the entire image. Imagen generates images by reversing accumulated noise through iterative refinement:
  3. • No privileged step: Each iteration has equal importance •

    No "finishing" concept: Objects don't get completed sequentially • No symbolic structure: Imagen never knows it's drawing a hand
  4. Why Imagen Is Inherently High-Variance Imagen samples from a wide,

    multimodal distribution where language dramatically under-constraints visual outcomes.
  5. In LLMs: • Temperature controls randomness • Retries with lower

    temperature converge • Errors can be corrected • Quality improves with iterations
  6. In Imagen: • Retries don't correct errors • They resample

    the distribution • Each attempt is independent • No convergence guarantee
  7. • Spatial constraints saturate early ◦ Adding "on the left"

    or "in the centre" provides diminishing control • Detail shifts style, not structure ◦ More descriptive prompts change texture and mood, not fundamental composition • Improvements hit limits fast ◦ Prompt engineering reaches a ceiling far earlier than with LLMs
  8. Matter at Hand • Imagen 4 generates more convincing hands

    with better anatomical appearance • Finger count remains probabilistic, not enforced • No hand-level structural invariants exist in the model
  9. Mirrors, Text, Symmetry • Imagen does not model: ◦ object

    identity across space ◦ bidirectional consistency ◦ symbols as symbols • It models local pixel correlations • Realism collapses under global constraints.
  10. Realism • learns how images usually look • captures texture,

    lighting, style • makes unlikely images less likely • produces convincing surface coherence • “Does this look like something I’ve seen before?” Understanding • no enforced rules or invariants • no object identity across space • no constraint satisfaction • “Is this necessarily correct?”
  11. What Improves Every Generation • texture fidelity • lighting &

    materials • style imitation • photorealism -> surface plausibility • instruction following • compositional reasoning • long-range coherence • constraint satisfaction -> behavioral reliability Imagen LLMs
  12. Imagen in a Startup Product: Viggle • uses Imagen 3

    to generate virtual characters from text • users describe characters • Imagen creates visually plausible characters for video • motion, timing, and storytelling are handled by the product
  13. Imagen in a Startup Product: Cartwheel • uses Imagen 3

    for text-to-character visual generation • creators describe characters in natural language • Imagen generates character visuals directly in the product • characters are then animated and exported fully rigged • Imagen supports creative ideation, not animation logic
  14. Lahore LLMs trained us to expect generative models to behave

    like software. But Imagen shows us: • outputs are sampled, not computed • realism can emerge without understanding • improvement does not imply new capabilities