Building reliable AI systems in production

How we test & deploy prompts at FirstQuadrant Anand Chowdhary
Co-founder & CTO of FirstQuadrant Inc.

Co-founder & CTO of FirstQuadrant Inc. AGENTS

Co-founder & CTO of FirstQuadrant Inc. AGENTS EXITED

Building reliable AI systems in production From assistants to agents
-> What a decade of building AI helpers taught me about productionizing agents Anand Chowdhary Head of Product at Sycamore Labs, Inc.

2016 ················ Oswald Labs 2017 2018 2019 2020 2021 2022
2023 2024 2025 2026

2016 ················ Oswald Labs 2017 2018 2019 ················ EIVA 2020
2021 2022 2023 2024 2025 2026

2016 ················ Oswald Labs 2017 2018 2019 ················ EIVA 2020
2021 2022 ················ FirstQuadrant 2023 2024 2025 2026

2016 ················ Oswald Labs 2017 2018 2019 ················ EIVA 2020
2021 2022 ················ FirstQuadrant 2023 2024 2025 2026 ················ Sycamore

2016 ················ Oswald Labs AI for accessibility 2017 2018 2019
················ EIVA AI scheduling assistant 2020 2021 2022 ················ FirstQuadrant AI sales assistant 2023 2024 2025 2026 ················ Sycamore AI enterprise agents

2016 ················ Oswald Labs Prediction models 2017 2018 2019 ················
EIVA NLP that parses emails 2020 2021 2022 ················ FirstQuadrant Workflows and prompts 2023 2024 2025 2026 ················ Sycamore Agents that complete work

Anand Chowdhary Founder · Y Combinator · Forbes 30 Under
30 Engineer · GitHub Stars Award 2021–* · Open source contributor Investor · GigaCatalyst YC P26 * Airweave YC P25 $6M seed Respan YC W24 $5M seed CommandCode $5M seed FirstQuadrant YC S21 $3M seed

Dumb software is over. People need software that runs itself.

Prompt era The old unit of deployment input → prompt
→ model → output + Versions Logs JSON schemas Output evals

Prompt era The old unit of deployment input → prompt
→ model → output + Versions Logs JSON schemas Output evals Agent era The new unit of deployment goal → plan → tools → observations → state → action ← loop + Memory Permissions Approvals Traces Side effects

Prompts answer. Agents act. Bad prompt → bad answer Bad
agent → wrong email, wrong ticket, wrong source, wrong permission… in the real world. That changes the failure mode.

Agents are models using tools in a loop –Hannah Moran,
Applied AI at Anthropic May 2025

Applied AI at Anthropic May 2025 May 2026?

Applied AI at Anthropic May 2025 May 2026, me? Agent = Model + Harness “tools in a loop”

Applied AI at Anthropic May 2025 May 2026, me? Agent = Model + Harness “tools in a loop” + context + trust

Empathy for Claude My Claude Code moment

Context + Tools + Trust = Agent

Context + Tools + Trust = Agent What it knows
What it can do Why you let it act

Context is not a bigger prompt It is the agent’s
working memory of the world.

working memory of the world. Context includes external systems, history, workspace, current task, and user preferences.

working memory of the world. Context includes external systems, history, workspace, current task, and user preferences. Context has scope, freshness, sensitivity, and source.

Memory is context The field is splitting memory into explicit
layers: 1. Working memory 2. Long-term facts 3. Archival search 4. Recurring notes 5. Graph memory

Memory should be transparent and inspectable OpenClaw treats memory less
like hidden model state and more like a filesystem: durable facts, daily notes, searchable recall, and human-reviewable summaries. MEMORY.md — durable long-term facts/preferences/decisions memory/YYYY-MM-DD.md — daily running notes DREAMS.md — reviewable summaries / background sweeps Hybrid memory search: vector + keyword Memory flush before compaction so important context doesn’t vanish

Memory can be namespaced LangGraph’s useful framing: memories are JSON
documents stored under namespaces and keys. - scoped by user/org/app - queryable - optionally vector-searchable - backed by real stores like Postgres

Memory can be a graph Zep is useful because it
frames memory as a user-level knowledge graph built from chat and session history. - add memory every turn - retrieve context from any session belonging to the user - session is used to determine relevance, not to limit scope - assistant messages can contextualize memory without necessarily being ingested as user facts - Graph API exists when the high-level memory API is too opinionated

Memory can be archival Letta & MemGPT’s useful distinction: always-visible
memory vs. archival memory. Some memory belongs in the prompt, some belongs behind a search tool, and some should not be remembered at all. - long-term - semantically searchable - not always pinned into context - retrieved through tools when needed - good for documents, logs, customer history, support tickets, research notes

Tools need to be designed for non-deterministic users: agents Raw
API surface: listRecords, getRecord, updateRecord, queryDatabase, sendEmail Agent-native surface: find_customer_context, prepare_reply_draft, request_send_approval, create_audited_ticket

Writing effective tools Agents are only as good as their
tools Tool design should be evaluation-driven Test tools on realistic tasks Inspect transcripts and failures Improve tool names, schemas, descriptions, and workflows

More tools can make the agent dumber If your agent
has 800 tools, the first failure is menu design. The future is not one agent with every tool. It is agents with the right tool belt for the current job.

Code execution At scale, tool calling becomes an information architecture
problem. Direct MCP tool-calling doesn’t scale when you have hundreds/thousands of tools. The next generation of tool use may look less like function calling and more like giving agents a small programming environment.

Code execution Tool definitions overload context Intermediate tool results bloat
token usage Code execution lets the agent inspect only what it needs The agent can filter/process data locally Anthropic’s own example reduced token used from 150k to 2k tokens anandchowdhary.com/blog/2026/agentscript

Trust is a feature, not a policy Permissions Approvals Audit
trail Rollback Human correction Evals

Guardrails and human review Block unsafe input before main agent
starts Validate/redact output before user sees it Check tool arguments/results Pause before side effects like cancellations, data edits, shell commands, or sensitive MCP actions

Autonomy is earned Suggest Draft Ask approval Guarded action Autonomy

Durable workflow pattern Production autonomy is less about the model
and more about the workflow runtime. 1. LLM proposes an action 2. Risky action pauses for human approval 3. Workflow waits without consuming compute 4. Approval comes through a signal 5. Timeout survives disruptions 6. Audit trail is preserved survives restarts!

Agents are infrastructure OpenAI: AgentKit, Codex, Apps in ChatGPT Microsoft:
open agentic web, coding agents Google: ADK, Agent2Agent, Agentspace Respan: evals, observability, production agents Every platform now has agents. Reliability is the differentiator.

Agent orchestration surfaces Agents are changing where human judgment sits.
Multi-agent coding workflows Orchestration around Claude Code/Agents SDK Context engineering as a first-class product concern “Multi-Claude” / parallel coding sessions Human control around hard problems in complex codebases

Evals Final-answer eval: Looks correct Trajectory eval: Used stale source,
skipped approval, ignored fresher context A right-looking answer can still come from the wrong process.

Evals Don’t just evaluate the answer. Evaluate the work. Goal,
plan, tool calls, observations, approvals, final action Ask: right context? Ask: right tool? Ask: right sequence? Ask: right approval? Ask: right side effect?

Every production agent needs a flight recorder. Capture: goal, plan,
context, tools, observations, approvals, state changes, final action, correction Incident → trace → label → regression eval → safer deployment

Is your agent a haunted Lambda? Acts, but nobody knows
why Has logs, but not traces Retries, but not intentionally Fails after touching real systems “But it worked yesterday” If you can’t replay it, you can’t improve it.

A real workflow: Meeting prep Gather context Identify gaps Draft
agenda Ask approval Update CRM / Linear / notes Summarize changes

What we learned the hard way The model is not
the product. Boundaries matter more than prompts. Tool design is product design. Evals need traces. Trust is earned through control.

Memory is becoming explicit files, stores, namespaces, graphs.

Memory is becoming explicit files, stores, namespaces, graphs. Tools are
becoming APIs for a new kind of user.

Memory is becoming explicit files, stores, namespaces, graphs. Tools are
becoming APIs for a new kind of user. Trust is becoming guardrails, approvals, audit trails.

Autonomy is becoming durable workflows. Evals are moving from answer
checks to trajectory checks. Agents are forcing us to build software engineering around model behavior.

Everyone is independently rediscovering that agents are systems. The industry
is converging on the same boring primitives. We are solving the same problems we solved in Linux decades ago.

The good news: If agents are systems, we already know
how to make them reliable. Pick one real workflow to agentify Log every step: context, tool calls, approvals, outputs, and replay failures as trajectories. Turn real traces into an eval harness Increase autonomy one permission at a time

Building reliable AI systems in production From assistants to agents
-> What a decade of building AI helpers taught me about productionizing agents @AnandChowdhary AnandChowdhary.com Chowdhary.co

Building reliable AI systems in production

Building reliable AI systems in production

More Decks by Anand Chowdhary

Other Decks in Technology

Featured

Transcript