Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building reliable AI systems in production

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

Building reliable AI systems in production

Avatar for Anand Chowdhary

Anand Chowdhary

May 08, 2026

More Decks by Anand Chowdhary

Other Decks in Technology

Transcript

  1. How we test & deploy prompts at FirstQuadrant Anand Chowdhary

    Co-founder & CTO of FirstQuadrant Inc.
  2. How we test & deploy prompts at FirstQuadrant Anand Chowdhary

    Co-founder & CTO of FirstQuadrant Inc. AGENTS
  3. How we test & deploy prompts at FirstQuadrant Anand Chowdhary

    Co-founder & CTO of FirstQuadrant Inc. AGENTS EXITED
  4. Building reliable AI systems in production From assistants to agents

    -> What a decade of building AI helpers taught me about productionizing agents Anand Chowdhary Head of Product at Sycamore Labs, Inc.
  5. 2016 ················ Oswald Labs 2017 2018 2019 ················ EIVA 2020

    2021 2022 ················ FirstQuadrant 2023 2024 2025 2026
  6. 2016 ················ Oswald Labs 2017 2018 2019 ················ EIVA 2020

    2021 2022 ················ FirstQuadrant 2023 2024 2025 2026
  7. 2016 ················ Oswald Labs 2017 2018 2019 ················ EIVA 2020

    2021 2022 ················ FirstQuadrant 2023 2024 2025 2026
  8. 2016 ················ Oswald Labs 2017 2018 2019 ················ EIVA 2020

    2021 2022 ················ FirstQuadrant 2023 2024 2025 2026
  9. 2016 ················ Oswald Labs 2017 2018 2019 ················ EIVA 2020

    2021 2022 ················ FirstQuadrant 2023 2024 2025 2026 ················ Sycamore
  10. 2016 ················ Oswald Labs AI for accessibility 2017 2018 2019

    ················ EIVA AI scheduling assistant 2020 2021 2022 ················ FirstQuadrant AI sales assistant 2023 2024 2025 2026 ················ Sycamore AI enterprise agents
  11. 2016 ················ Oswald Labs Prediction models 2017 2018 2019 ················

    EIVA NLP that parses emails 2020 2021 2022 ················ FirstQuadrant Workflows and prompts 2023 2024 2025 2026 ················ Sycamore Agents that complete work
  12. Anand Chowdhary Founder · Y Combinator · Forbes 30 Under

    30 Engineer · GitHub Stars Award 2021–* · Open source contributor Investor · GigaCatalyst YC P26 * Airweave YC P25 $6M seed Respan YC W24 $5M seed CommandCode $5M seed FirstQuadrant YC S21 $3M seed
  13. Prompt era The old unit of deployment input → prompt

    → model → output + Versions Logs JSON schemas Output evals
  14. Prompt era The old unit of deployment input → prompt

    → model → output + Versions Logs JSON schemas Output evals Agent era The new unit of deployment goal → plan → tools → observations → state → action ← loop + Memory Permissions Approvals Traces Side effects
  15. Prompts answer. Agents act. Bad prompt → bad answer Bad

    agent → wrong email, wrong ticket, wrong source, wrong permission… in the real world. That changes the failure mode.
  16. Agents are models using tools in a loop –Hannah Moran,

    Applied AI at Anthropic May 2025 May 2026?
  17. Agents are models using tools in a loop –Hannah Moran,

    Applied AI at Anthropic May 2025 May 2026, me? Agent = Model + Harness “tools in a loop”
  18. Agents are models using tools in a loop –Hannah Moran,

    Applied AI at Anthropic May 2025 May 2026, me? Agent = Model + Harness “tools in a loop” + context + trust
  19. Context + Tools + Trust = Agent What it knows

    What it can do Why you let it act
  20. Context + Tools + Trust = Agent What it knows

    What it can do Why you let it act
  21. Context is not a bigger prompt It is the agent’s

    working memory of the world. Context includes external systems, history, workspace, current task, and user preferences.
  22. Context is not a bigger prompt It is the agent’s

    working memory of the world. Context includes external systems, history, workspace, current task, and user preferences. Context has scope, freshness, sensitivity, and source.
  23. Memory is context The field is splitting memory into explicit

    layers: 1. Working memory 2. Long-term facts 3. Archival search 4. Recurring notes 5. Graph memory
  24. Memory should be transparent and inspectable OpenClaw treats memory less

    like hidden model state and more like a filesystem: durable facts, daily notes, searchable recall, and human-reviewable summaries. MEMORY.md — durable long-term facts/preferences/decisions memory/YYYY-MM-DD.md — daily running notes DREAMS.md — reviewable summaries / background sweeps Hybrid memory search: vector + keyword Memory flush before compaction so important context doesn’t vanish
  25. Memory can be namespaced LangGraph’s useful framing: memories are JSON

    documents stored under namespaces and keys. - scoped by user/org/app - queryable - optionally vector-searchable - backed by real stores like Postgres
  26. Memory can be a graph Zep is useful because it

    frames memory as a user-level knowledge graph built from chat and session history. - add memory every turn - retrieve context from any session belonging to the user - session is used to determine relevance, not to limit scope - assistant messages can contextualize memory without necessarily being ingested as user facts - Graph API exists when the high-level memory API is too opinionated
  27. Memory can be archival Letta & MemGPT’s useful distinction: always-visible

    memory vs. archival memory. Some memory belongs in the prompt, some belongs behind a search tool, and some should not be remembered at all. - long-term - semantically searchable - not always pinned into context - retrieved through tools when needed - good for documents, logs, customer history, support tickets, research notes
  28. Context + Tools + Trust = Agent What it knows

    What it can do Why you let it act
  29. Tools need to be designed for non-deterministic users: agents Raw

    API surface: listRecords, getRecord, updateRecord, queryDatabase, sendEmail Agent-native surface: find_customer_context, prepare_reply_draft, request_send_approval, create_audited_ticket
  30. Writing effective tools Agents are only as good as their

    tools Tool design should be evaluation-driven Test tools on realistic tasks Inspect transcripts and failures Improve tool names, schemas, descriptions, and workflows
  31. More tools can make the agent dumber If your agent

    has 800 tools, the first failure is menu design. The future is not one agent with every tool. It is agents with the right tool belt for the current job.
  32. Code execution At scale, tool calling becomes an information architecture

    problem. Direct MCP tool-calling doesn’t scale when you have hundreds/thousands of tools. The next generation of tool use may look less like function calling and more like giving agents a small programming environment.
  33. Code execution Tool definitions overload context Intermediate tool results bloat

    token usage Code execution lets the agent inspect only what it needs The agent can filter/process data locally Anthropic’s own example reduced token used from 150k to 2k tokens anandchowdhary.com/blog/2026/agentscript
  34. Context + Tools + Trust = Agent What it knows

    What it can do Why you let it act
  35. Trust is a feature, not a policy Permissions Approvals Audit

    trail Rollback Human correction Evals
  36. Guardrails and human review Block unsafe input before main agent

    starts Validate/redact output before user sees it Check tool arguments/results Pause before side effects like cancellations, data edits, shell commands, or sensitive MCP actions
  37. Durable workflow pattern Production autonomy is less about the model

    and more about the workflow runtime. 1. LLM proposes an action 2. Risky action pauses for human approval 3. Workflow waits without consuming compute 4. Approval comes through a signal 5. Timeout survives disruptions 6. Audit trail is preserved survives restarts!
  38. Agents are infrastructure OpenAI: AgentKit, Codex, Apps in ChatGPT Microsoft:

    open agentic web, coding agents Google: ADK, Agent2Agent, Agentspace Respan: evals, observability, production agents Every platform now has agents. Reliability is the differentiator.
  39. Agent orchestration surfaces Agents are changing where human judgment sits.

    Multi-agent coding workflows Orchestration around Claude Code/Agents SDK Context engineering as a first-class product concern “Multi-Claude” / parallel coding sessions Human control around hard problems in complex codebases
  40. Evals Final-answer eval: Looks correct Trajectory eval: Used stale source,

    skipped approval, ignored fresher context A right-looking answer can still come from the wrong process.
  41. Evals Don’t just evaluate the answer. Evaluate the work. Goal,

    plan, tool calls, observations, approvals, final action Ask: right context? Ask: right tool? Ask: right sequence? Ask: right approval? Ask: right side effect?
  42. Every production agent needs a flight recorder. Capture: goal, plan,

    context, tools, observations, approvals, state changes, final action, correction Incident → trace → label → regression eval → safer deployment
  43. Is your agent a haunted Lambda? Acts, but nobody knows

    why Has logs, but not traces Retries, but not intentionally Fails after touching real systems “But it worked yesterday” If you can’t replay it, you can’t improve it.
  44. A real workflow: Meeting prep Gather context Identify gaps Draft

    agenda Ask approval Update CRM / Linear / notes Summarize changes
  45. What we learned the hard way The model is not

    the product. Boundaries matter more than prompts. Tool design is product design. Evals need traces. Trust is earned through control.
  46. Memory is becoming explicit files, stores, namespaces, graphs. Tools are

    becoming APIs for a new kind of user. Trust is becoming guardrails, approvals, audit trails.
  47. Autonomy is becoming durable workflows. Evals are moving from answer

    checks to trajectory checks. Agents are forcing us to build software engineering around model behavior.
  48. Everyone is independently rediscovering that agents are systems. The industry

    is converging on the same boring primitives. We are solving the same problems we solved in Linux decades ago.
  49. The good news: If agents are systems, we already know

    how to make them reliable. Pick one real workflow to agentify Log every step: context, tool calls, approvals, outputs, and replay failures as trajectories. Turn real traces into an eval harness Increase autonomy one permission at a time
  50. Building reliable AI systems in production From assistants to agents

    -> What a decade of building AI helpers taught me about productionizing agents @AnandChowdhary AnandChowdhary.com Chowdhary.co