Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Natural Language Analytics on the Open Lakehous...

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Natural Language Analytics on the Open Lakehouse Why is it harder than it looks? - Meetup Talk from Singapore Apache Kafka meetup April 10th 2026

Talk which explains the natural language analytics on the Open Lakehouse and compares the naive approach of connecting agents directly to the data layer vs connecting it through semantic layer. Demo to explain the impressive and blindspots when agent connects directly to data layer and how adding a semantic layer fixes the blindspot. It uses a context assisted semantic layer approach

Avatar for Zabeer Farook

Zabeer Farook

April 14, 2026

More Decks by Zabeer Farook

Other Decks in Technology

Transcript

  1. HELLO !! I’m Zabeer Farook Principal Engineering Architect @ Credit

    Agricole CIB - Passionate about Data Architecture including Stream data processing & Event Driven Architecture as well as Cloud & DevOps. - Love travelling & exploring places
  2. Agenda 01 Natural Language Analytics on the Lakehouse Open Lakehouse:

    The De Facto Standard 05 03 02 How does it work in a Lakehouse? 04 The impressive side of things (with demo) The blind spot & Why 06 How to fix the blind spot? (with demo) 07 Q&A
  3. Open Lakehouse - The De Facto Standard Uses Opentable formats

    like Iceberg, Deltalake, Hudi Combines best from Data Warehouse (Performance) & Data Lakes (Flexibility) Provides full ACID (Atomicity, Consistency, Isolation, Durability) transaction guarantees Interoperability with different engines like Spark, Flink, Trino Cost Efficient & Scalable by leveraging Object Storage like S3 Open Architecture without Vendor lock-in Modern Data Platforms are increasingly being built with Lakehouse architecture with Open Table Formats like Iceberg Performance & Flexibility Transactional Guarantees Cost Efficiency & Scalability Opentable Formats Interoperability No Vendor Lock-in
  4. Open Lakehouse Architecture Ingest Store & Process Consume AI/ML &

    Data Science Batch & Real time Analytics Reports & BI Data Lake Raw Data Cleansed Data Stream Batch Structured, Semi Structured & Unstructured Data Metadata Layer with Data Governance, Indexing and Data Management Lakehouse API Layer Serve REST CATALOG Ingestion Layer Real time ingestion layer to ingest data from real time sources like Kafka + Batch ingestion layer Storage Layer Storage layer with Object storage like Minio, AWS S3, GCS. Data itself will be stored in formats like parquet, avro or orc Metadata Layer Metadata layer is made of open table formats like Iceberg and catalog to manage metadata files Processing Layer Raw data can be further processed in batch or streaming mode with engines like Spark , Flink Serving Layer Serving layer with query engines like Trino providing query and API capabilities to the consumption layer Consumption Layer Consumption layer for AI/ML, Reporting, Analytics & Visualization
  5. Natural Language Analytics on the Lakehouse "Can I ask questions

    on my data in natural language?” “Show me revenue trends by region for last quarter” “Why did sales drop in APAC last month?” “Which customers are at risk of churning?” “What’s our gross margin by product line?” AI Agents + MCP finally makes this technically feasible Any LLM can discover your lakehouse tools, plan a query strategy, execute SQL, and synthesise results - in natural language in seconds
  6. How does it work in a Lakehouse? (The naive approach)

    User / Application “Show me revenue trends by region for last quarter” AI Agent (Claude / GPT / Any LLM) Plans . Reasons . Calls tools via MCP . Synthesise results getSchema() . listTables() . describeTable() . executeQuery() MCP Server on Lakehouse Tools Trino (or other query engines) SQL execution Lakehouse with Apache Iceberg (or other table formats) Storage MCP = Open API for AI Agents . Open standard from Anthropic . Works with any LLM
  7. MCP - The bridge for agents to access external tools

    MCP (Model Context Protocol) is an open, standardized interface that enables LLMs to • Interact seamlessly and securely to external systems, API’s and data sources • Have Agentic AI capabilities which makes them an intelligent operator who can take action based on natural language inputs Communication Protocol -> JSON-RPC Transport Protocol -> Stdio / SSE / Streamable HTTP Integrating different tools with multiple AI systems poses complex integration challenges and frictions without a standard communication & data exchange protocol between AI Systems & external tools. USB - C for AI Agents
  8. The impressive side of things Agent connects directly to the

    Lakehouse Q2 Table Level Question “List the available tables” Q3 Column Related Question How many orders were placed in Q1 Q4 Simple Query on Data Q5 Aggregated Query on Data What columns does the customer table have “What are the available schemas” Schema Level Question Q1 Show top 5 nations by total order value Let’s see it in action with an Iceberg Lakehouse with TPC-H Dataset Orders LineItem Customer Supplier Part Region PartSupplier Nation TPC-H Dataset
  9. The blind spot Same setup, but different questions this time

    Revenue is totalPrice or extendedPrice or extendedPrice * (1 - discount) ? Might use wrong join path Agent guesses the definition of Premium Customers “Show me Q1 revenue for premium customers” ✗ Ambiguity in formula ✗ Ambigious Business Term ✗ Potentially wrong join path ✗ Confident wrong answer Might end up giving incorrect values and sounds confident Let’s see in action with questions involving some business terms
  10. Why the blind spot 1 2 3 ✗ No business

    vocabulary ✗ No metric definitions ✗ No relationship awareness The agent sees column names and data types. It has no access to your business terminology — what 'premium' or ‘active' means in your organisation. It guesses. Sometimes convincingly. Revenue could be totalprice, extendedprice, or extendedprice×(1−discount). The agent has no way to know which formula your business uses. It invents one that looks plausible. Multi-table joins require domain knowledge. Customer → Orders → Lineitem → Supplier is not derivable from schema alone. Wrong join path, wrong answer — every time Agents need more than table schemas, They need BUSINESS CONTEXT which is missing
  11. How to fix the blind spot? A Unified Semantic Layer

    exposed as API to the LLM “The contract between your organisational knowledge and your agents” What it is ? ⚡ How it helps ? 🏗 How to get there ? A unified framework with metadata, business glossaries, ontologies, knowledge graphs, metric definitions organised and abstracted giving machines a way to understand data in context, not just access it When it is exposed as an API via MCP, agents call meaning instead of guessing from raw schema. For defined metrics, the semantic layer can even execute and return a trusted answer directly leaving just the synthesis work to the agent Most organisations already have fragments of the semantic layer within BI tools, data catalogs, as well as in analyst heads. The work is formalising and define them in the right tools and exposing it as a callable API to the AI Agents through MCP
  12. Semantic Layer - Capablities & Examples Semantic Layer Element What

    it answers? Example Metadata What is this field? Orders.totalPrice: decimal, USD, daily refresh, owner: Finance Glossary/Taxonomy What does this term mean? “Premium Customer” = account_balance >= $7000 & revenue >= $300000 Ontology How do concepts relate? Customer places Order Knowledge Graph What are the actual facts? Bob placed Order#123 Metrics/KPIs What & How do we measure? Revenue, Gross margin
  13. Can we do even better? Intelligent Data Platform: Powered by

    Semantic Layer, MCP & Agents Scan & Register here API Days Singapore 14th/15th April 2026 RAW DATA ACCESS - Agent does the guessing game - Potentially wrong join paths - Ambiguous formulas & terms CONTEXT ASSISTED - Agent gets context in terms of metadata formulas & terms - Smarter - Agent still writes SQL, can get it wrong EXECUTABLE SEMANTIC LAYER - Agent gets the metric values - Semantic layer executes the SQL - Cheaper (less tokens) , Faster, Consistent, Deterministic, Secure with access control How a strong semantic layer makes AI agents trustworthy
  14. Q&A