Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Streaming Lakehouse to Agentic AI - Kafka, Flin...

Avatar for Zabeer Farook Zabeer Farook
September 18, 2025

Streaming Lakehouse to Agentic AI - Kafka, Flink & Iceberg meets MCP

Slides from the Singapore Apache Kafka meetup talk on September 18th 2025. Presents the concept of using MCP (Model Context Protocol) in a Data Lakehouse setup to enable natural language based conversational Data Analysis & Exploration in a Lakehouse using AI tools and LLM with MCP

Avatar for Zabeer Farook

Zabeer Farook

September 18, 2025
Tweet

More Decks by Zabeer Farook

Other Decks in Technology

Transcript

  1. HELLO !! I’m Zabeer Farook Technical Architect, Credit Agricole CIB

    - Passionate about Data Architecture including Stream data processing & Event Driven Architecture as well as Cloud & DevOps. - Love travelling & exploring places
  2. AGENDA 04 05 06 MCP’ing the Lakehouse Industry Use cases

    & Security best practices Demo Introduction to MCP (Model Context Protocol) 03 Streaming Lakehouse Recap 07 Closing & QA 01 02 Context & Background
  3. Context & Background • Combines best from Data Warehouse &

    Data Lakes • Performance • Provides Transactional Guarantees • Cost Efficient & Scalable by leveraging Object Storage like S3 • Uses Opentable formats like Iceberg • Interoperability with different engines like Spark, Flink, Trino • Open Architecture without Vendor lock-in MCP (Model Context Protocol) brings Agentic AI capabilities to the Lakehouse and enables conversational data exploration & analysis Modern Data Platforms are increasingly being built with Lakehouse architecture with Open Table Formats
  4. Streaming Lakehouse Recap Ingest Store & Process Consume AI/ML &

    Data Science Batch & Real time Analytics Reports & BI Data Lake Raw Data Cleansed Data Stream Batch Structured, Semi Structured & Unstructured Data Metadata Layer with Data Governance, Indexing and Data Management • Real time ingestion layer to ingest data from real time sources like Kafka • Storage layer with Object storage like Minio, AWS S3, GCS. Data itself will be stored in formats like parquet, avro or orc. • Metadata layer is made of open table formats like Iceberg with features like ACID compliance & Time travel. • Raw data can be further processed in batch or streaming mode with engines like Flink • Serving layer with query engines like Trino providing query and API capabilities to the consumption layer for AI/ML, Reporting, Analytics & Visualization Streaming Lakehouse API Layer Serve REST CATALOG
  5. Introduction to MCP - Evolution of LLM’s Simple LLM’s (pure

    text generation with limited context window) RAG Systems (Realtime knowledge retrieval from external sources) Tool calling Agents (Function calling & external tool integration) Agentic RAG (Intelligent retrieval decisions & query planning) Multi Agent Systems (Agent Collaboration & distributed problem solving)
  6. Introduction to MCP - Why & What? MCP (Model Context

    Protocol) is an open, standardized interface that enables LLMs to • Interact seamlessly and securely to external systems, API’s and data sources • Have Agentic AI capabilities which makes them an intelligent operator who can take action based on natural language inputs Communication Protocol -> JSON-RPC Transport Protocol -> Stdio / SSE / Streamable HTTP Integrating different tools with multiple AI systems poses complex integration challenges and frictions without a standard communication & data exchange protocol between AI Systems & external tools. Imagine the situation of not having a USB - C or Type - C port in your computer or not having a Universal Travel Adapter while travelling
  7. MCP’ing the Lakehouse • Exposing the Lakehouse tools (Kafka, Flink,

    Iceberg etc.) through MCP • Natural language based operations and explorations in the lakehouse Sample Natural Language Data Exploration Prompts: • What are the different Flink jobs running in my Dev Cluster? • What are the top 10 products ordered today by customers? • What are the products low in stock?
  8. Industry Use Cases • Data Platform Democratization ◦ Self Service

    Analytics without SQL knowledge for business users ◦ Automated data lineage • Intelligent Lakehouse Operations Management ◦ Automated Compaction, Snapshot Expiration, Orphan file cleanup ◦ Data quality monitoring and automated remediation • Automated Monitoring ◦ Real-time pipeline monitoring (Kafka lag, throughput) ◦ Ops Assistant for Kafka/Flink clusters • Performance & Cost Optimization ◦ Query Performance Analysis & Recommendations ◦ Resource Utilization & Right Sizing Potential Lakehouse / Data platform specific use cases with MCP and Agentic AI are....
  9. Security & Guardrails • Strong auth & authz - not

    every user should be able to perform admin operations • Guardrails are critical - Scope the tools carefully • Security - Sanitize and Validate the inputs to prevent injection attacks • Privacy - Expose only necessary data fields & mask sensitive data • Cost Control - Avoid over querying LLM or large tables
  10. Closing Thoughts “The future of data platforms is open, streaming

    and agentic - built on lakehouses that think and act with us“ MCP is powerful, but still evolving, adopt and adjust as it evolves
  11. Q&A