→ model → output + Versions Logs JSON schemas Output evals Agent era The new unit of deployment goal → plan → tools → observations → state → action ← loop + Memory Permissions Approvals Traces Side effects
working memory of the world. Context includes external systems, history, workspace, current task, and user preferences. Context has scope, freshness, sensitivity, and source.
like hidden model state and more like a filesystem: durable facts, daily notes, searchable recall, and human-reviewable summaries. MEMORY.md — durable long-term facts/preferences/decisions memory/YYYY-MM-DD.md — daily running notes DREAMS.md — reviewable summaries / background sweeps Hybrid memory search: vector + keyword Memory flush before compaction so important context doesn’t vanish
frames memory as a user-level knowledge graph built from chat and session history. - add memory every turn - retrieve context from any session belonging to the user - session is used to determine relevance, not to limit scope - assistant messages can contextualize memory without necessarily being ingested as user facts - Graph API exists when the high-level memory API is too opinionated
memory vs. archival memory. Some memory belongs in the prompt, some belongs behind a search tool, and some should not be remembered at all. - long-term - semantically searchable - not always pinned into context - retrieved through tools when needed - good for documents, logs, customer history, support tickets, research notes
tools Tool design should be evaluation-driven Test tools on realistic tasks Inspect transcripts and failures Improve tool names, schemas, descriptions, and workflows
has 800 tools, the first failure is menu design. The future is not one agent with every tool. It is agents with the right tool belt for the current job.
problem. Direct MCP tool-calling doesn’t scale when you have hundreds/thousands of tools. The next generation of tool use may look less like function calling and more like giving agents a small programming environment.
token usage Code execution lets the agent inspect only what it needs The agent can filter/process data locally Anthropic’s own example reduced token used from 150k to 2k tokens anandchowdhary.com/blog/2026/agentscript
starts Validate/redact output before user sees it Check tool arguments/results Pause before side effects like cancellations, data edits, shell commands, or sensitive MCP actions
and more about the workflow runtime. 1. LLM proposes an action 2. Risky action pauses for human approval 3. Workflow waits without consuming compute 4. Approval comes through a signal 5. Timeout survives disruptions 6. Audit trail is preserved survives restarts!
open agentic web, coding agents Google: ADK, Agent2Agent, Agentspace Respan: evals, observability, production agents Every platform now has agents. Reliability is the differentiator.
Multi-agent coding workflows Orchestration around Claude Code/Agents SDK Context engineering as a first-class product concern “Multi-Claude” / parallel coding sessions Human control around hard problems in complex codebases
plan, tool calls, observations, approvals, final action Ask: right context? Ask: right tool? Ask: right sequence? Ask: right approval? Ask: right side effect?
why Has logs, but not traces Retries, but not intentionally Fails after touching real systems “But it worked yesterday” If you can’t replay it, you can’t improve it.
how to make them reliable. Pick one real workflow to agentify Log every step: context, tool calls, approvals, outputs, and replay failures as trajectories. Turn real traces into an eval harness Increase autonomy one permission at a time