Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Giving Eyes to Your AI: Building Vision-Enabled...

Giving Eyes to Your AI: Building Vision-Enabled Agents with Haystack

This tutorial shows developers how to build multimodal, vision-enabled agents with Haystack, combining LLM reasoning with visual understanding. In this session, you’ll learn to:

- Extend agents with vision-language models to process images and PDFs, then build an end-to-end agent that answers from text + visuals.
- Deploy your multimodal agent with tools like Open WebUI and Hayhooks for real-world use.

Colab notebook: https://github.com/bilgeyucel/presentations/blob/main/agentic_ai_conference_2025/Multimodal_Agent.ipynb
Multimodal Agent demo: https://github.com/deepset-ai/haystack-demos/tree/main/multimodal_agent

Avatar for Bilge Yücel

Bilge Yücel

September 17, 2025
Tweet

More Decks by Bilge Yücel

Other Decks in Programming

Transcript

  1. Giving Eyes to Your AI Building Vision-Enabled Agents with Haystack

    Bilge Yücel, Developer Relations Engineer @ deepset
  2. Bilge Yücel Hello 👋 • Developer Relations Engineer at deepset

    • B.Sc. Computer Science • M.Sc. Artificial Intelligence • Learning & teaching how to build with AI in/bilge-yucel @bilgeyucl
  3. Who is deepset? Company Solving Custom AI challenges since 2018.

    HQ in Berlin and NYC. Backed by: Leading open source framework & commercial platforms for custom enterprise-grade AI Products Used by 70 Thought leaders
  4. Agenda 01 Core Concepts: Agents & Workflows & VLMs 02

    Agents in Haystack 03 Giving Eyes to Our AI Colab & Demo 04 Q & A
  5. What is an AI agent? An AI agent is an

    LLM-based system that autonomously pursues a goal by interacting with its environment using tools. Human LLM Call Environment Action Feedback Stop
  6. Agents vs Workflows • Autonomous system • You have complex,

    multi-step problems requiring diverse actions • Tasks involve multiple tools / sources • Goals are clear but the optimal path to achieve them isn't predetermined
  7. Agents vs Workflows • Autonomous system • You have complex,

    multi-step problems requiring diverse actions • Tasks involve multiple tools / sources • Goals are clear but the optimal path to achieve them isn't predetermined • Defined flow • When interactions can follow predictable patterns (e.g. Q&A • Tasks can be decomposed into clear steps • Stability, robustness and efficiency are prioritized over automation potential
  8. What is a Vision Language Model? Text → Text: Large

    Language Models LLMs Text + Image → Text: Vision Language Models VLMs They are a type of generative models that take image and text inputs, and generate text outputs. Suitable for Visual Q&A. Hugging Face article: https://huggingface.co/blog/vlms • How to choose the right Vision Language Model • Technical Details • Fine-tuning Vision Language Models How is it different from an LLM?
  9. • Open-source AI orchestration framework by deepset • Backbone of

    the deepset AI Platform • Provides the tools that Python developers need to build real world, agentic AI systems with visibility, control and modularity • Building blocks: Components & Pipelines
  10. Haystack Agents User Request Agent LM (e.g. OpenAI, Anthropic, Google,

    Open Models) System Prompt Python Functions External APIs Haystack Components MCP Servers Generated Answer
  11. Multimodal Agent • Can process images + PDFs • Access

    to tool: get weather • 🏗 Orchestration: Haystack • 🧠 Model Provider: OpenAI • 💻 Platform: Google Colab
  12. Hayhooks • Tool to deploy and serve Haystack Pipelines and

    Agents as REST APIs and MCP Servers • Wrap Pipelines and Agents with custom logic and expose them via HTTP endpoints • OpenAI-compatible chat completion endpoints→ Open WebUI
  13. Multimodal Agent • Can process images + PDFs • Access

    to tool: get weather • 🏗 Orchestration: Haystack • 🧠 Model Provider: OpenAI • 🔌 REST API Hayhooks • 🖼 UI Open WebUI
  14. Next Steps • Provide more tools ◦ Connect to MCP

    Servers ◦ Use a multimodal RAG pipeline as a Tool ◦ Prebuilt Github tools • Try open models through Ollama, vLLM, Hugging Face • Think about observability, guardrails
  15. Building AI Agents with Haystack Build a Tool-Calling Agent AI

    Guardrails: Content Moderation and Safety with Open Language Models Creating a Multi-Agent System with Haystack Trace and Evaluate RAG with Arize Phoenix DevOps Support Agent with Human in the Loop Build a GitHub Issue Resolver Agent Build a GitHub PR Creator Agent Building AI Agents with Haystack