Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploring AI-native Directions for Lakehouse

Exploring AI-native Directions for Lakehouse

Avatar for Open Data Circle

Open Data Circle

January 27, 2026
Tweet

More Decks by Open Data Circle

Other Decks in Technology

Transcript

  1. Exploring AI-native Directions for Lakehouse From today’s leading solutions to

    tomorrow’s architectural roadmap Xiao Zhiyan – Open Data Circle January 27th, 2026
  2. Chongqing, China 8D Magic City Hot Pot The Chinese University

    of Hong Kong Mathematics Information Engineering LY Corporation Software Engineer Streaming Data Pipeline Spark Iceberg Open Data Circle Big Data x Rust AI-native Streamhouse Chongqing Hong Kong Tokyo Xiao Zhiyan @xiaozhiyan
  3. • Not a product pitch • Not a catalog of

    tools or features • Not a fixed architecture proposal • A practical exploration of AI-native directions for lakehouse • Grounded in real systems and production constraints • Focused on patterns, trade-offs, and evolution paths What this talk is What this talk is not What This Talk Is (and Is Not)
  4. Why AI Changes the Default Requirements • RAG / semantic

    search / AI agents are moving from demos to production • “Freshness” and “retrieval latency” are now product requirements • Batch-only assumptions start to break (e.g. user-facing AI) • Random access + low-latency reads (not just scans) • Index-aware data layout: vectors / embeddings / multimodal • Continuous updates + lifecycle ops (re-embed / re-index) The pressure is already here What AI workloads demand
  5. The Mismatch with Traditional Lakehouse Assumptions • Scan-heavy, batch-oriented access

    patterns • Indexing is optional, external, or delayed • Metadata focuses on snapshots, not semantics • Maintenance is offline and human-triggered • Retrieval-heavy, latency-sensitive access • Index is part of the data, not an afterthought • Semantics (embeddings, etc.) are first-class • Maintenance must be continuous and automated Traditional lakehouse assumptions AI workload reality
  6. Two Common Reactions (and Why They Fall Short) • Add

    vector search • Add another service • Add another pipeline → Complexity grows faster than capability • Rewrite everything • Bet on a single new stack → High risk, low adoption Reaction 1: Add more features Reaction 2: Jump to a brand-new system
  7. The Core Question We’ll Answer Today “How are today’s leading

    systems actually evolving under AI and streaming pressure?” What we’ll do next • Look at representative solutions (production & frontier) • Treat them as coordinate points, not winners • Connect them into a coherent architectural map Takeaway • A framework you can reuse internally • A roadmap you can start from tomorrow
  8. Landscape Overview: From Points to a Map • Freshness &

    change propagation (Batch → Streaming) • Access & indexing model (Scan-based → Index/Vector-aware) • Not a ranking • Not a recommendation • Coordinate points to reason about evolution paths under AI pressure How to read this landscape Two dimensions
  9. • Iceberg / Delta / Hudi • Analytics-first (scan-based) •

    Mostly append-heavy data • Batch or micro-batch processing Underlying assumptions What they solve well • ACID transactions & snapshot isolation • Schema evolution & time travel • Metadata-driven architecture (separate storage & compute) • Index lifecycle stays external to the table • Random access becomes inefficient • Continuous updates are not first-class #1 Open Table Formats (Baseline) Representative What starts to break under AI
  10. • Iceberg / Delta + Milvus / OpenSearch / Elastic

    • Embeddings generated & synced externally • Fast to prototype • Flexible ecosystem • Production-proven today • Consistency & sync pipelines • Re-embedding & re-indexing over time • Reproducibility becomes hard (time travel vs. index state) Typical pattern Hidden cost Why it works #2 Lakehouse + External Vector DB (Pattern)
  11. What you must operate • Sync & refresh: keep vectors

    aligned with table updates (CDC / batch / hybrid) • Re-embed: data changes, model upgrades, prompt changes → embeddings must be regenerated • Re-index: incremental updates vs full rebuild; compaction/merge effects • Rollback & time travel: can you reproduce results from an earlier snapshot? • Drift management: data/model/version drift becomes a continuous problem Why this becomes painful • Complexity scales with (data change rate × model iteration rate × index types) • Most “AI-ready” incidents are lifecycle issues, not vector search queries The Real Cost: Index Lifecycle
  12. • Most AI-ready failures are lifecycle issues (refresh / drift

    / rollback), not query syntax • Platformization is the industry’s response: “we manage the loop for you” • Databricks: Delta + Vector Search (managed sync index) • Snowflake: Cortex Search (hybrid search) • BigQuery: Vector Search + Vector Index (managed refresh) • Vector index becomes a managed capability • Index lifecycle is absorbed: build / refresh / incremental sync • Governance is integrated: ACL, lineage, monitoring • Less flexibility (you follow platform’s knobs) • Strong coupling to one ecosystem (data, compute, serving) #3 Integrated Vector Search (Platformized) Representative signals Why it matters Trade-off vs “external vector DB”
  13. • Lance • Fast kNN / hybrid retrieval close to

    the dataset • Efficient reads beyond scans (random access patterns) • Dataset-level reproducibility with versions + index metadata • Vectors & multimodal data are first-class citizens • Index-aware + versioned + random-access friendly • Index lifecycle is treated as part of the dataset story (not “an external sync pipeline you own”) • Designed for AI/ML access patterns from day one (embeddings, reranking, multimodal features) #4 — AI-native Table Format (Lance) Representative What it enables (in practice) What’s fundamentally different Core idea
  14. • Apache Paimon • Freshness becomes a default requirement (RAG

    / agents / monitoring) • AI workloads amplify the cost of stale data and delayed propagation • “Index lifecycle” is meaningless if your underlying data isn’t continuously updated • Maintenance must be automated and continuous (not a nightly job) • Streaming-first ingestion • Upserts / changelog as first-class data • LSM-style storage + compaction for continuous writes • Efficient incremental consumption for downstream pipelines #5 — Streaming-native Table (Paimon) Representative Why it matters for AI What it focuses on
  15. • Apache Fluss (incubating) • Continuous ingestion without forcing everything

    into lake files immediately • Ordered log + retention as a first-class storage primitive • Continuous materialization / tiering into Iceberg / Paimon / etc. • Streaming storage as a hot tier • Built for low latency and high throughput • Lakehouse tables become the cold / historical tier #6 — Streaming Storage (Fluss) Representative What it enables Positioning
  16. Fluss as the real-time hot tier • ordered changes, retention,

    replay “freshness” becomes a storage property Lance as the AI-native lake tier • versioned datasets + index-aware layout vectors / multimodal access near the data • Continuous tiering turns freshness + indexability into defaults (not glue pipelines) • Less glue: fewer sync / rebuild pipelines, fewer “who owns consistency?” debates • Better failure model: replay + rebuild can be reasoned via log semantics + dataset versions • Faster iteration: model upgrades become policies (re-embed / re-index) instead of ad-hoc scripts #7 — Frontier Stack: Fluss + Lance Clean separation of concerns Why this matters
  17. • Lifecycle gets productized External sync → managed index →

    index-aware formats • Freshness shifts downward Pipeline concern → table concern → storage-tier concern • Systems move from “features” to “loops” • Embed → index → serve → monitor → re-embed/re-index becomes the new default • Don’t ask “which one is best?” — ask where your pain sits (freshness, retrieval, or lifecycle cost) • Avoid extremes: more glue vs full rewrite • Choose an evolution path with clear trade-offs • Treat each coordinate as a capability package, not a brand Patterns we saw How to use the patterns Landscape Summary — Patterns, Not Winners
  18. Baseline From Points to a Path Open table formats +

    batch-first analytics → ACID, time travel, governance (but scan-heavy) AI-ready (v1) Lakehouse + external vector DB → fastest adoption, but you own the lifecycle loop AI-ready (v2) Managed / integrated vector search → lifecycle is platformized (less glue, more coupling) Streaming × AI-native direction Streaming-native ingestion + index-aware formats → freshness + retrieval + lifecycle become first-class design concerns Freshness becomes default streaming moves closer to storage Retrieval becomes index-aware vector / hybrid access becomes normal Lifecycle becomes managed embed/index/refresh shifts from glue to platform
  19. Layered View — A Mental Model Layer 4 — AI

    Lifecycle (control loop) • Embed / re-embed policy (model & prompt upgrades) • Re-index / refresh orchestration • Feedback & eval (quality, drift, cost) Layer 3 — Indexing & Retrieval (serving path) • Vector / hybrid retrieval • Low-latency random access • Reproducibility for “what did the model see?” Layer 2 — Storage & Table Format (truth layer) • ACID + versions (time travel) • Upserts / compaction / clustering • Index-aware layout (when applicable) Layer 1 — Streaming & Change Propagation (freshness layer) • Continuous ingestion • Changelog as first-class signal • Event-time semantics & incremental consumption
  20. • Freshness becomes a default requirement • Upserts / changelog

    become first-class • Maintenance shifts toward automation • Batch-first lakehouse tables • Scan-based reads, offline optimization • AI experiments stay “outside” • External or integrated vector search • Embedding + sync pipelines appear • Lifecycle becomes a real cost center • Index-aware storage & retrieval • Lifecycle managed as a platform loop • Fewer glue pipelines, more policies The Migration Path (High-Level) Step 0 — Baseline (Analytics-first) Step 2 — Streaming-aware Step 3 — AI-native (Loop-driven) Step 1 — AI-ready (v1 / v2)
  21. ✅ Fast to adopt ❌ Operational complexity grows ✅ Better

    freshness & responsiveness ❌ Requires rethinking storage & maintenance ✅ Simplified lifecycle, better performance ❌ Ecosystem still evolving AI-ready (v1/v2) AI-native (loop-driven) Streaming-aware What Changes at Each Step (Trade-offs)
  22. • Map where your pain actually is (freshness, latency, ops

    cost, reproducibility) • Identify who owns the index lifecycle today (scripts? teams? nobody?) • Measure how often data & models change • Accept current constraints (skills, budget, ecosystem) • Reduce glue before adding features (fewer sync pipelines > faster vector search) • Make freshness & lifecycle explicit design goals • Evolve layer by layer, not by big rewrites • Choose components that align with your next step Start with reality (not architecture) Practical Guidance — For Platform Teams Move with direction (not perfection)
  23. Practical Guidance — For Engineers & Builders • Reason across

    layers: streaming → table → index → AI lifecycle • Make lifecycle explicit: embed / re-embed / refresh / rollback • Learn “why”, not only “how”: access pattern → storage layout → cost • Practice with real constraints: latency, freshness, ops, governance • Don’t memorize tool stacks (they change faster than principles) • Don’t treat vector DB as a magic box (most issues are lifecycle) • Don’t ignore updates/deletes (freshness & correctness will bite later) • Don’t optimize one layer in isolation (you’ll move the pain elsewhere) Do (build durable skills) Don’t (common traps)
  24. Bringing It All Together • AI changes the default requirements

    Random access, retrieval latency, and lifecycle become “always-on” concerns • The hard part is not vector search — it’s the lifecycle loop Embed → index → serve → monitor → re-embed / re-index / rollback • There is no single winner — but there is a clear direction Streaming-first freshness + index-aware access + loop-driven operations
  25. • More meetups around Lakehouse × AI × Streaming •

    More community experiments (RAG demos, format deep dives, benchmarks) • More shared artifacts: slides, code, references • A community for builders of modern data & AI platforms • A place to share frameworks, trade-offs, and real lessons • A space to co-create: learn → test → refine → share What’s next What ODC is not • Not a vendor roadmap • Not a “tool demo club” • Not a one-way broadcast • Bring a problem, a hypothesis, or a prototype • Ask hard questions, share honest constraints • Help us turn “trends” into usable knowledge Why We’re Exploring This Together — Open Data Circle (ODC) What ODC is How to join