Upgrade to Pro — share decks privately, control downloads, hide ads and more …

YourAl Is Only As Good As Your Data Pipeline

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for Lee Wei Lee Wei
June 02, 2026
2

YourAl Is Only As Good As Your Data Pipeline

Avatar for Lee Wei

Lee Wei

June 02, 2026

Transcript

  1. COMPUTEX · TAIPEI · 2026-06-02 Your AI Is Only As

    Good As Your Data Pipeline. Wei Lee · Jason (Zhe-You) Liu APACHE AIRFLOW · PMC & COMMITTER / ASTRONOMER / OpenSource4You · 源來適你 COMPUTEX 2026
  2. WHAT IS OPEN SOURCE © Lucasfilm Ltd. / The Walt

    Disney Company Star Wars: The Force Awakens (2015) ASTRONOMER 04
  3. YOUR HOSTS Wei Lee · Jason Liu Wei Lee APACHE

    AIRFLOW · PMC MEMBER Maintainer of Apache Airflow. Open-source contributor at Astronomer. Jason Liu APACHE AIRFLOW · COMMITTER Apache Airflow committer. Open-source contributor at Astronomer. ASTRONOMER 05
  4. COMPUTEX 2026 · WHAT'S ON THE FLOOR The hardware here

    is absolutely insane. Vera Rubin NVL72 Blackwell Jetson Thor AMD MI350X Instinct EPYC Arm Neoverse AI Factory Physical AI Agentic AI HBM4 NVLink Gen 6 Liquid Cooling Rack-scale We have never had this much compute available. ASTRONOMER 08
  5. "Great! Now we can generate garbage faster." A faster GPU

    just gets you to the wrong answer sooner. ASTRONOMER 11
  6. A FAMILIAR EXAMPLE Building a data pipeline. 01 Fetch DOCS

    / PDFS → 02 Chunk SPLIT & CLEAN → 03 Embed LLM / API → 04 Upsert VECTOR DB → 05 Run Agent ANY PROVIDER Looks simple. Until something breaks at 3 AM ASTRONOMER 13
  7. WHAT BREAKS Every step has a failure mode. Fetch :

    source moved, API expired, 429. Chunk : bad encoding, wrong format, OOM. Embed : rate limit, model changed, cost spike. Upsert : network, index full, schema mismatch. Run Agent : wrong tools, context overflow, hallucinated calls. ASTRONOMER 14
  8. YOU. AT 3 AM. Production. Everything's on fire. Without a

    pipeline, this is your only option. "This Is Fine" by KC Green · Gunshow (2013) gunshowcomic.com ASTRONOMER 15
  9. THE THREE THINGS A PIPELINE NEEDS Every pipeline must be...

    01 Reliable It runs. Every time. Even when things break upstream. 0 2 Observable You can see what's happening. What failed. And why. 03 Retryable Failures are normal. You can pick up where it broke. ASTRONOMER 16
  10. WHAT IS AI, REALLY? Strip AI down to its plumbing.

    TRAINING Data → Model INFERENCE Data → Model → Decision Take the data away. Nothing left. GPT-4 was trained on ~13 trillion tokens. Every chat prompt is data. Every RAG retrieval is data. Take it all out, and you're left with an empty matrix of weights. A fancy zero. ASTRONOMER 18
  11. WHAT IS CONTEXT, ACTUALLY? Context = a fancy word for

    data. Retrieved chunks → data from your vector DB Conversation history → data from the chat so far Tool / function outputs → data your agent just looked up User memory → data about who you're talking to System prompt → data you wrote into the app All data. Just renamed. ASTRONOMER 19
  12. WHAT IF YOUR AI... "I'm not afraid of forgetting." ©

    Bushiroad / BanG Dream! Project BanG Dream! Ave Mujica ASTRONOMER 21
  13. But your boss does. When the AI gives a wrong

    answer to a real customer. ASTRONOMER 22
  14. WRONG CONTEXT CAN COST YOU It already has. B B

    C Airline held liable for its chatbot giving passenger bad advice — what this means for travellers 23 February 2024 Maria Yagoda Features correspondent Headline reproduction · BBC News, 23 Feb 2024 ASTRONOMER 23
  15. MOFFATT V. AIR CANADA · 2024 The chatbot made up

    a refund that didn't exist. BC CIVIL RESOLUTION TRIBUNAL · CANADA ~$650 CAD partial refund + tribunal fees, ordered by court "The chatbot is a separate legal entity, responsible for its own actions." Air Canada's actual defense in court Tribunal member Christopher Rivers: "This is a remarkable submission." Source: BC CRT ruling, 14 Feb 2024 · BBC News · CBC · ABA ASTRONOMER 24
  16. WHEN THE BILL ARRIVES... "Our income is really too low..."

    © Bushiroad / BanG Dream! Project BanG Dream! Ave Mujica ASTRONOMER 25
  17. CALL-BACK Manage context the way we manage data. 01 Pull

    FROM YOUR SOURCES → 02 Filter WHAT'S RELEVANT → 03 Pack INTO THE PROMPT → 04 Ask THE AI → 05 Act TOOLS / SIDE EFFECTS ASTRONOMER 27
  18. REMEMBER THIS? The old data pipeline. 01 Fetch DOCS /

    PDFS → 02 Chunk SPLIT & CLEAN → 03 Embed LLM / API → 04 Upsert VECTOR DB → 05 Run Agent ANY PROVIDER Identical to the AI context pipeline. Box by box. Arrow by arrow. ASTRONOMER 28
  19. PLUS A NEW CONSTRAINT Context window is limited. You can't

    stuff everything in. Bigger window = slower & more expensive. Picking what to send is the real work. ASTRONOMER 31
  20. Reliable. Your context must arrive. Every time. Every wrong context

    becomes a wrong answer to a real customer. ASTRONOMER 33
  21. Observable. When the AI gives a bad answer, you must

    know why. You can't fix what you can't see. And AI failures are silent. ASTRONOMER 34
  22. Retryable. He who has never seen a network timeout, cast

    the first stone. Rate limits. Model changes. Cost spikes. The failures are bigger now. ASTRONOMER 35
  23. Born in 2015. Still shipping in 2026. A stand from

    ten years ago. And there's no such thing as a weak stand. Apache Airflow® · The Apache Software Foundation airflow.apache.org ASTRONOMER 37
  24. FROM ONE TASK TO THOUSANDS It runs your whole graph.

    Apache Airflow® Graph view · airflow.apache.org ASTRONOMER 38
  25. SO HOW DO YOU USE IT? Jason will show you.

    U P NEXT Jason Liu A PACHE A I R F L O W · C O M M I T T E R → T O P I C Why Airflow, specifically. ASTRONOMER 39
  26. From static config to runtime decision. ◆ → Read the

    trace. Pick the strategy. Diagnose inline. Page with context.
  27. Many homes. One best fit. Run by the team that

    ships the next release. Zero to production, today. Lineage, alerts, catalogue — built in. SSO. RBAC. Audit. Day one.
  28. By the committers who ship the next release. CVE patches

    + multi-zone resilience. Local dev to cloud, one command. → → →