Exploring AI-native Directions for Lakehouse

Exploring AI-native Directions for Lakehouse From today’s leading solutions to
tomorrow’s architectural roadmap Xiao Zhiyan – Open Data Circle January 27th, 2026

Chongqing, China 8D Magic City Hot Pot The Chinese University
of Hong Kong Mathematics Information Engineering LY Corporation Software Engineer Streaming Data Pipeline Spark Iceberg Open Data Circle Big Data x Rust AI-native Streamhouse Chongqing Hong Kong Tokyo Xiao Zhiyan @xiaozhiyan

• Not a product pitch • Not a catalog of
tools or features • Not a ﬁxed architecture proposal • A practical exploration of AI-native directions for lakehouse • Grounded in real systems and production constraints • Focused on patterns, trade-offs, and evolution paths What this talk is What this talk is not What This Talk Is (and Is Not)

Why AI Changes the Default Requirements • RAG / semantic
search / AI agents are moving from demos to production • “Freshness” and “retrieval latency” are now product requirements • Batch-only assumptions start to break (e.g. user-facing AI) • Random access + low-latency reads (not just scans) • Index-aware data layout: vectors / embeddings / multimodal • Continuous updates + lifecycle ops (re-embed / re-index) The pressure is already here What AI workloads demand

The Mismatch with Traditional Lakehouse Assumptions • Scan-heavy, batch-oriented access
patterns • Indexing is optional, external, or delayed • Metadata focuses on snapshots, not semantics • Maintenance is offline and human-triggered • Retrieval-heavy, latency-sensitive access • Index is part of the data, not an afterthought • Semantics (embeddings, etc.) are ﬁrst-class • Maintenance must be continuous and automated Traditional lakehouse assumptions AI workload reality

Two Common Reactions (and Why They Fall Short) • Add
vector search • Add another service • Add another pipeline → Complexity grows faster than capability • Rewrite everything • Bet on a single new stack → High risk, low adoption Reaction 1: Add more features Reaction 2: Jump to a brand-new system

The Core Question We’ll Answer Today “How are today’s leading
systems actually evolving under AI and streaming pressure?” What we’ll do next • Look at representative solutions (production & frontier) • Treat them as coordinate points, not winners • Connect them into a coherent architectural map Takeaway • A framework you can reuse internally • A roadmap you can start from tomorrow

Landscape Overview: From Points to a Map • Freshness &
change propagation (Batch → Streaming) • Access & indexing model (Scan-based → Index/Vector-aware) • Not a ranking • Not a recommendation • Coordinate points to reason about evolution paths under AI pressure How to read this landscape Two dimensions

• Iceberg / Delta / Hudi • Analytics-ﬁrst (scan-based) •
Mostly append-heavy data • Batch or micro-batch processing Underlying assumptions What they solve well • ACID transactions & snapshot isolation • Schema evolution & time travel • Metadata-driven architecture (separate storage & compute) • Index lifecycle stays external to the table • Random access becomes inefficient • Continuous updates are not ﬁrst-class #1 Open Table Formats (Baseline) Representative What starts to break under AI

• Iceberg / Delta + Milvus / OpenSearch / Elastic
• Embeddings generated & synced externally • Fast to prototype • Flexible ecosystem • Production-proven today • Consistency & sync pipelines • Re-embedding & re-indexing over time • Reproducibility becomes hard (time travel vs. index state) Typical pattern Hidden cost Why it works #2 Lakehouse + External Vector DB (Pattern)

What you must operate • Sync & refresh: keep vectors
aligned with table updates (CDC / batch / hybrid) • Re-embed: data changes, model upgrades, prompt changes → embeddings must be regenerated • Re-index: incremental updates vs full rebuild; compaction/merge effects • Rollback & time travel: can you reproduce results from an earlier snapshot? • Drift management: data/model/version drift becomes a continuous problem Why this becomes painful • Complexity scales with (data change rate × model iteration rate × index types) • Most “AI-ready” incidents are lifecycle issues, not vector search queries The Real Cost: Index Lifecycle

• Most AI-ready failures are lifecycle issues (refresh / drift
/ rollback), not query syntax • Platformization is the industry’s response: “we manage the loop for you” • Databricks: Delta + Vector Search (managed sync index) • Snowﬂake: Cortex Search (hybrid search) • BigQuery: Vector Search + Vector Index (managed refresh) • Vector index becomes a managed capability • Index lifecycle is absorbed: build / refresh / incremental sync • Governance is integrated: ACL, lineage, monitoring • Less ﬂexibility (you follow platform’s knobs) • Strong coupling to one ecosystem (data, compute, serving) #3 Integrated Vector Search (Platformized) Representative signals Why it matters Trade-off vs “external vector DB”

• Lance • Fast kNN / hybrid retrieval close to
the dataset • Efficient reads beyond scans (random access patterns) • Dataset-level reproducibility with versions + index metadata • Vectors & multimodal data are ﬁrst-class citizens • Index-aware + versioned + random-access friendly • Index lifecycle is treated as part of the dataset story (not “an external sync pipeline you own”) • Designed for AI/ML access patterns from day one (embeddings, reranking, multimodal features) #4 — AI-native Table Format (Lance) Representative What it enables (in practice) What’s fundamentally different Core idea

• Apache Paimon • Freshness becomes a default requirement (RAG
/ agents / monitoring) • AI workloads amplify the cost of stale data and delayed propagation • “Index lifecycle” is meaningless if your underlying data isn’t continuously updated • Maintenance must be automated and continuous (not a nightly job) • Streaming-ﬁrst ingestion • Upserts / changelog as ﬁrst-class data • LSM-style storage + compaction for continuous writes • Efficient incremental consumption for downstream pipelines #5 — Streaming-native Table (Paimon) Representative Why it matters for AI What it focuses on

• Apache Fluss (incubating) • Continuous ingestion without forcing everything
into lake ﬁles immediately • Ordered log + retention as a ﬁrst-class storage primitive • Continuous materialization / tiering into Iceberg / Paimon / etc. • Streaming storage as a hot tier • Built for low latency and high throughput • Lakehouse tables become the cold / historical tier #6 — Streaming Storage (Fluss) Representative What it enables Positioning

Fluss as the real-time hot tier • ordered changes, retention,
replay “freshness” becomes a storage property Lance as the AI-native lake tier • versioned datasets + index-aware layout vectors / multimodal access near the data • Continuous tiering turns freshness + indexability into defaults (not glue pipelines) • Less glue: fewer sync / rebuild pipelines, fewer “who owns consistency?” debates • Better failure model: replay + rebuild can be reasoned via log semantics + dataset versions • Faster iteration: model upgrades become policies (re-embed / re-index) instead of ad-hoc scripts #7 — Frontier Stack: Fluss + Lance Clean separation of concerns Why this matters

• Lifecycle gets productized External sync → managed index →
index-aware formats • Freshness shifts downward Pipeline concern → table concern → storage-tier concern • Systems move from “features” to “loops” • Embed → index → serve → monitor → re-embed/re-index becomes the new default • Don’t ask “which one is best?” — ask where your pain sits (freshness, retrieval, or lifecycle cost) • Avoid extremes: more glue vs full rewrite • Choose an evolution path with clear trade-offs • Treat each coordinate as a capability package, not a brand Patterns we saw How to use the patterns Landscape Summary — Patterns, Not Winners

Baseline From Points to a Path Open table formats +
batch-ﬁrst analytics → ACID, time travel, governance (but scan-heavy) AI-ready (v1) Lakehouse + external vector DB → fastest adoption, but you own the lifecycle loop AI-ready (v2) Managed / integrated vector search → lifecycle is platformized (less glue, more coupling) Streaming × AI-native direction Streaming-native ingestion + index-aware formats → freshness + retrieval + lifecycle become ﬁrst-class design concerns Freshness becomes default streaming moves closer to storage Retrieval becomes index-aware vector / hybrid access becomes normal Lifecycle becomes managed embed/index/refresh shifts from glue to platform

Layered View — A Mental Model Layer 4 — AI
Lifecycle (control loop) • Embed / re-embed policy (model & prompt upgrades) • Re-index / refresh orchestration • Feedback & eval (quality, drift, cost) Layer 3 — Indexing & Retrieval (serving path) • Vector / hybrid retrieval • Low-latency random access • Reproducibility for “what did the model see?” Layer 2 — Storage & Table Format (truth layer) • ACID + versions (time travel) • Upserts / compaction / clustering • Index-aware layout (when applicable) Layer 1 — Streaming & Change Propagation (freshness layer) • Continuous ingestion • Changelog as ﬁrst-class signal • Event-time semantics & incremental consumption

• Freshness becomes a default requirement • Upserts / changelog
become first-class • Maintenance shifts toward automation • Batch-first lakehouse tables • Scan-based reads, offline optimization • AI experiments stay “outside” • External or integrated vector search • Embedding + sync pipelines appear • Lifecycle becomes a real cost center • Index-aware storage & retrieval • Lifecycle managed as a platform loop • Fewer glue pipelines, more policies The Migration Path (High-Level) Step 0 — Baseline (Analytics-first) Step 2 — Streaming-aware Step 3 — AI-native (Loop-driven) Step 1 — AI-ready (v1 / v2)

✅ Fast to adopt ❌ Operational complexity grows ✅ Better
freshness & responsiveness ❌ Requires rethinking storage & maintenance ✅ Simpliﬁed lifecycle, better performance ❌ Ecosystem still evolving AI-ready (v1/v2) AI-native (loop-driven) Streaming-aware What Changes at Each Step (Trade-offs)

• Map where your pain actually is (freshness, latency, ops
cost, reproducibility) • Identify who owns the index lifecycle today (scripts? teams? nobody?) • Measure how often data & models change • Accept current constraints (skills, budget, ecosystem) • Reduce glue before adding features (fewer sync pipelines > faster vector search) • Make freshness & lifecycle explicit design goals • Evolve layer by layer, not by big rewrites • Choose components that align with your next step Start with reality (not architecture) Practical Guidance — For Platform Teams Move with direction (not perfection)

Practical Guidance — For Engineers & Builders • Reason across
layers: streaming → table → index → AI lifecycle • Make lifecycle explicit: embed / re-embed / refresh / rollback • Learn “why”, not only “how”: access pattern → storage layout → cost • Practice with real constraints: latency, freshness, ops, governance • Don’t memorize tool stacks (they change faster than principles) • Don’t treat vector DB as a magic box (most issues are lifecycle) • Don’t ignore updates/deletes (freshness & correctness will bite later) • Don’t optimize one layer in isolation (you’ll move the pain elsewhere) Do (build durable skills) Don’t (common traps)

Bringing It All Together • AI changes the default requirements
Random access, retrieval latency, and lifecycle become “always-on” concerns • The hard part is not vector search — it’s the lifecycle loop Embed → index → serve → monitor → re-embed / re-index / rollback • There is no single winner — but there is a clear direction Streaming-ﬁrst freshness + index-aware access + loop-driven operations

• More meetups around Lakehouse × AI × Streaming •
More community experiments (RAG demos, format deep dives, benchmarks) • More shared artifacts: slides, code, references • A community for builders of modern data & AI platforms • A place to share frameworks, trade-offs, and real lessons • A space to co-create: learn → test → reﬁne → share What’s next What ODC is not • Not a vendor roadmap • Not a “tool demo club” • Not a one-way broadcast • Bring a problem, a hypothesis, or a prototype • Ask hard questions, share honest constraints • Help us turn “trends” into usable knowledge Why We’re Exploring This Together — Open Data Circle (ODC) What ODC is How to join

Thank You!

Exploring AI-native Directions for Lakehouse

Exploring AI-native Directions for Lakehouse

Open Data Circle

More Decks by Open Data Circle

Other Decks in Technology

Featured

Transcript

Exploring AI-native Directions for Lakehouse From today’s leading solutions to

Chongqing, China 8D Magic City Hot Pot The Chinese University

• Not a product pitch • Not a catalog of

Why AI Changes the Default Requirements • RAG / semantic

The Mismatch with Traditional Lakehouse Assumptions • Scan-heavy, batch-oriented access

Two Common Reactions (and Why They Fall Short) • Add

The Core Question We’ll Answer Today “How are today’s leading

Landscape Overview: From Points to a Map • Freshness &

• Iceberg / Delta / Hudi • Analytics-ﬁrst (scan-based) •

• Iceberg / Delta + Milvus / OpenSearch / Elastic

What you must operate • Sync & refresh: keep vectors

• Most AI-ready failures are lifecycle issues (refresh / drift

• Lance • Fast kNN / hybrid retrieval close to

• Apache Paimon • Freshness becomes a default requirement (RAG

• Apache Fluss (incubating) • Continuous ingestion without forcing everything

Fluss as the real-time hot tier • ordered changes, retention,

• Lifecycle gets productized External sync → managed index →

Baseline From Points to a Path Open table formats +

Layered View — A Mental Model Layer 4 — AI

• Freshness becomes a default requirement • Upserts / changelog

✅ Fast to adopt ❌ Operational complexity grows ✅ Better

• Map where your pain actually is (freshness, latency, ops

Practical Guidance — For Engineers & Builders • Reason across

Bringing It All Together • AI changes the default requirements

• More meetups around Lakehouse × AI × Streaming •

Thank You!