Natural Language Analytics on the Open Lakehouse Why is it harder than it looks? - Meetup Talk from Singapore Apache Kafka meetup April 10th 2026

Natural Language Analytics on the Open Lakehouse Why is it
harder than it looks?

HELLO !! I’m Zabeer Farook Principal Engineering Architect @ Credit
Agricole CIB - Passionate about Data Architecture including Stream data processing & Event Driven Architecture as well as Cloud & DevOps. - Love travelling & exploring places

Agenda 01 Natural Language Analytics on the Lakehouse Open Lakehouse:
The De Facto Standard 05 03 02 How does it work in a Lakehouse? 04 The impressive side of things (with demo) The blind spot & Why 06 How to fix the blind spot? (with demo) 07 Q&A

Open Lakehouse - The De Facto Standard Uses Opentable formats
like Iceberg, Deltalake, Hudi Combines best from Data Warehouse (Performance) & Data Lakes (Flexibility) Provides full ACID (Atomicity, Consistency, Isolation, Durability) transaction guarantees Interoperability with different engines like Spark, Flink, Trino Cost Efﬁcient & Scalable by leveraging Object Storage like S3 Open Architecture without Vendor lock-in Modern Data Platforms are increasingly being built with Lakehouse architecture with Open Table Formats like Iceberg Performance & Flexibility Transactional Guarantees Cost Efficiency & Scalability Opentable Formats Interoperability No Vendor Lock-in

Open Lakehouse Architecture Ingest Store & Process Consume AI/ML &
Data Science Batch & Real time Analytics Reports & BI Data Lake Raw Data Cleansed Data Stream Batch Structured, Semi Structured & Unstructured Data Metadata Layer with Data Governance, Indexing and Data Management Lakehouse API Layer Serve REST CATALOG Ingestion Layer Real time ingestion layer to ingest data from real time sources like Kafka + Batch ingestion layer Storage Layer Storage layer with Object storage like Minio, AWS S3, GCS. Data itself will be stored in formats like parquet, avro or orc Metadata Layer Metadata layer is made of open table formats like Iceberg and catalog to manage metadata ﬁles Processing Layer Raw data can be further processed in batch or streaming mode with engines like Spark , Flink Serving Layer Serving layer with query engines like Trino providing query and API capabilities to the consumption layer Consumption Layer Consumption layer for AI/ML, Reporting, Analytics & Visualization

Natural Language Analytics on the Lakehouse "Can I ask questions
on my data in natural language?” “Show me revenue trends by region for last quarter” “Why did sales drop in APAC last month?” “Which customers are at risk of churning?” “What’s our gross margin by product line?” AI Agents + MCP finally makes this technically feasible Any LLM can discover your lakehouse tools, plan a query strategy, execute SQL, and synthesise results - in natural language in seconds

How does it work in a Lakehouse? (The naive approach)
User / Application “Show me revenue trends by region for last quarter” AI Agent (Claude / GPT / Any LLM) Plans . Reasons . Calls tools via MCP . Synthesise results getSchema() . listTables() . describeTable() . executeQuery() MCP Server on Lakehouse Tools Trino (or other query engines) SQL execution Lakehouse with Apache Iceberg (or other table formats) Storage MCP = Open API for AI Agents . Open standard from Anthropic . Works with any LLM

MCP - The bridge for agents to access external tools
MCP (Model Context Protocol) is an open, standardized interface that enables LLMs to • Interact seamlessly and securely to external systems, API’s and data sources • Have Agentic AI capabilities which makes them an intelligent operator who can take action based on natural language inputs Communication Protocol -> JSON-RPC Transport Protocol -> Stdio / SSE / Streamable HTTP Integrating different tools with multiple AI systems poses complex integration challenges and frictions without a standard communication & data exchange protocol between AI Systems & external tools. USB - C for AI Agents

Demo (The naive approach)

The impressive side of things Agent connects directly to the
Lakehouse Q2 Table Level Question “List the available tables” Q3 Column Related Question How many orders were placed in Q1 Q4 Simple Query on Data Q5 Aggregated Query on Data What columns does the customer table have “What are the available schemas” Schema Level Question Q1 Show top 5 nations by total order value Let’s see it in action with an Iceberg Lakehouse with TPC-H Dataset Orders LineItem Customer Supplier Part Region PartSupplier Nation TPC-H Dataset

The blind spot Same setup, but different questions this time
Revenue is totalPrice or extendedPrice or extendedPrice * (1 - discount) ? Might use wrong join path Agent guesses the definition of Premium Customers “Show me Q1 revenue for premium customers” ✗ Ambiguity in formula ✗ Ambigious Business Term ✗ Potentially wrong join path ✗ Confident wrong answer Might end up giving incorrect values and sounds confident Let’s see in action with questions involving some business terms

Why the blind spot 1 2 3 ✗ No business
vocabulary ✗ No metric deﬁnitions ✗ No relationship awareness The agent sees column names and data types. It has no access to your business terminology — what 'premium' or ‘active' means in your organisation. It guesses. Sometimes convincingly. Revenue could be totalprice, extendedprice, or extendedprice×(1−discount). The agent has no way to know which formula your business uses. It invents one that looks plausible. Multi-table joins require domain knowledge. Customer → Orders → Lineitem → Supplier is not derivable from schema alone. Wrong join path, wrong answer — every time Agents need more than table schemas, They need BUSINESS CONTEXT which is missing

How to fix the blind spot? A Unified Semantic Layer
exposed as API to the LLM “The contract between your organisational knowledge and your agents” What it is ? ⚡ How it helps ? 🏗 How to get there ? A unified framework with metadata, business glossaries, ontologies, knowledge graphs, metric definitions organised and abstracted giving machines a way to understand data in context, not just access it When it is exposed as an API via MCP, agents call meaning instead of guessing from raw schema. For defined metrics, the semantic layer can even execute and return a trusted answer directly leaving just the synthesis work to the agent Most organisations already have fragments of the semantic layer within BI tools, data catalogs, as well as in analyst heads. The work is formalising and define them in the right tools and exposing it as a callable API to the AI Agents through MCP

Semantic Layer - Capablities & Examples Semantic Layer Element What
it answers? Example Metadata What is this ﬁeld? Orders.totalPrice: decimal, USD, daily refresh, owner: Finance Glossary/Taxonomy What does this term mean? “Premium Customer” = account_balance >= $7000 & revenue >= $300000 Ontology How do concepts relate? Customer places Order Knowledge Graph What are the actual facts? Bob placed Order#123 Metrics/KPIs What & How do we measure? Revenue, Gross margin

Demo (with a light weight Semantic Layer) https://github.com/Zabi82/SemanticMCPLightWeight

Can we do even better? Intelligent Data Platform: Powered by
Semantic Layer, MCP & Agents Scan & Register here API Days Singapore 14th/15th April 2026 RAW DATA ACCESS - Agent does the guessing game - Potentially wrong join paths - Ambiguous formulas & terms CONTEXT ASSISTED - Agent gets context in terms of metadata formulas & terms - Smarter - Agent still writes SQL, can get it wrong EXECUTABLE SEMANTIC LAYER - Agent gets the metric values - Semantic layer executes the SQL - Cheaper (less tokens) , Faster, Consistent, Deterministic, Secure with access control How a strong semantic layer makes AI agents trustworthy

Natural Language Analytics on the Open Lakehous...

Natural Language Analytics on the Open Lakehouse Why is it harder than it looks? - Meetup Talk from Singapore Apache Kafka meetup April 10th 2026

Zabeer Farook

More Decks by Zabeer Farook

Other Decks in Technology

Featured

Transcript

Natural Language Analytics on the Open Lakehouse Why is it

HELLO !! I’m Zabeer Farook Principal Engineering Architect @ Credit

Agenda 01 Natural Language Analytics on the Lakehouse Open Lakehouse:

Open Lakehouse - The De Facto Standard Uses Opentable formats

Open Lakehouse Architecture Ingest Store & Process Consume AI/ML &

Natural Language Analytics on the Lakehouse "Can I ask questions

How does it work in a Lakehouse? (The naive approach)

MCP - The bridge for agents to access external tools

Demo (The naive approach)

The impressive side of things Agent connects directly to the

The blind spot Same setup, but different questions this time

Why the blind spot 1 2 3 ✗ No business

How to fix the blind spot? A Uniﬁed Semantic Layer

Semantic Layer - Capablities & Examples Semantic Layer Element What

Demo (with a light weight Semantic Layer) https://github.com/Zabi82/SemanticMCPLightWeight

Can we do even better? Intelligent Data Platform: Powered by

Q&A