Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Stop Hacking Prompts

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.

Stop Hacking Prompts

Reliability is the central challenge blocking enterprise AI adoption at scale. This session provides a framework that teams can apply regardless of model provider or agent runtime — advancing the community's shared understanding of how to systematically capture and codify agent behavior. The skill store pattern is directly applicable to any organization building multi-agent systems and contributes a concrete architectural model for the ecosystem to build on and extend.

Avatar for Hugo Guerrero

Hugo Guerrero

June 04, 2026

More Decks by Hugo Guerrero

Other Decks in Technology

Transcript

  1. QCon AI Boston 2026 Stop Hacking Prompts Architecting for Deterministic

    Outcomes in GenAI Systems Hugo Guerrero Governance Board & TSC Member @AsyncAPI Initiative Technical Product Lead @ Kong
  2. Session Roadmap 01 The Mirage of the Perfect Prompt 02

    Why Prompt Hacking Cannot Deliver Determinism 03 Human-in-the-Loop as a Governance Layer 04 Capturing & Codifying Success Paths 05 Building the Agent Skill Store 06 Practical Example: Network Diagnosis 07 Org Imperatives & Closing Blueprint
  3. The Enterprise AI Reality Models are impressive GPT-4, Claude, Gemini

    — these models pass bar exams, write code, synthesize research. The capabilities are genuinely extraordinary. Systems are unreliable Yet enterprise teams can't ship them to production with confidence. The gap between demo and deployment has never been wider. ≠
  4. Why Most AI Projects Stall Three compounding failure modes that

    block production deployment 01 Inconsistent Outputs The same prompt produces different results across runs, model versions, and context states. Teams can't establish a reliability baseline. 02 Governance Concerns Compliance, legal, and security teams can't audit what they can't inspect. Neural network weights are not an acceptable audit trail. 03 Operational Risk Hallucinations, planning drift, and unpredictable failure modes make AI agents too risky to own high-stakes automated workflows. These are not model problems. They are architecture problems.
  5. The Industry Response And why it keeps failing More detailed

    prompts Add more context, more instructions, more examples More constraints Enumerate every edge case, every prohibited action More prompt engineering Hire specialists, build internal tooling, run prompt eval suites More complex instructions Multi-step prompt chains, few-shot libraries, RAG pipelines The Pattern > Short-term improvement False confidence > Context drift Regression > More prompt hacking ...repeat Treating a probabilistic system like a configuration file is the root cause.
  6. The Spark They wanted reliability They were coaxing creativity LLMs

    navigate probability space Not configuration files “We spent days fine-tuning the wording, the structure, and the contextual hints. At times, the output appeared correct, but then the model would produce odd suggestions or skip safety validations.” — System Integrator, India 
  7. Three Reasons Prompt Hacking Cannot Deliver Determinism Probabilistic Sampling Every

    token is drawn from a distribution. Temperature controls creativity, not correctness. Even at temp=0 you're taking the peak of a probability curve — not executing a rule. Extreme Context Sensitivity Minor shifts in conversation history, memory state, or system message alter the output. The accumulated context of three successful runs creates a different probability landscape than a fresh agent. No Procedural Memory A perfect execution is not stored as a reusable process. Tomorrow the agent re-plans from scratch. Every run is improvisation. Improvisation is the enemy of the enterprise. The model cannot be deterministic. Determinism must come from the system surrounding the model.
  8. Two Myths That Keep Teams Stuck MYTH 1: "Just set

    temperature=0" What teams believe: ✗ Low temp = deterministic output ✗ Fixed seed = repeatable results The reality: ✓ Model updates break seed guarantees ✓ Infra changes shift token peaks ✓ Temperature ≠ correctness control MYTH 2: "Bigger context = more reliable" What teams believe: ✗ More context = better grounding ✗ 200K tokens eliminates gaps The reality: ✓ More context = more ambiguity ✓ Model can’t weight relevance perfectly ✓ Larger windows compound drift You cannot parameterize your way to determinism. You need architecture.
  9. The Architecture Shift FROM: Prompt-Centric ✗ LLM is the executor

    for every task ✗ Reliability depends on prompt precision ✗ Success is lost at end of session ✗ Every run is improvisation ✗ Audit means reading model outputs ✗ Scale = more inference tokens → TO: Architecture-Centric ✓ LLM is a router to proven tools ✓ Reliability comes from the system ✓ Success is captured and codified ✓ Recurring tasks execute deterministically ✓ Audit means reading executable code ✓ Scale = skill store grows, costs drop
  10. Architectural Determinism 01 Reasoning Layer LLM explores novel problems, plans

    approaches, generates candidate solutions 02 Human Oversight Expert validation confirms correctness, compliance, safety, and strategic alignment 03 Codification Layer Validated paths converted into executable artifacts: code, documentation, metadata 04 Skill Store Persistent artifact library — agent routes known problems to deterministic execution Flywheel: each new solved problem expands the skill store → less LLM involvement → lower cost + higher reliability
  11. Architectural Determinism: System View EXTERNAL INPUT REASONING LAYER HUMAN OVERSIGHT

    CODIFICATION SKILL STORE User Request task intent Context / Memory session history Existing Artifacts? skill store lookup YES ▶ Route to Artifact skip reasoning NO ▼ LLM Inference novel problem reasoning Tool Selection plan + action sequence Candidate Output reasoning trace ▶ ▶ ▼ Domain Review correctness check Compliance Check policy + risk Intent Alignment strategic context ▶ ▶ APPROVE? ← REJECT: back to reasoning ▼ Extract Logic reasoning → code Add Docs rationale + audit Define Rules guards + metadata Version & Tag semver + deps ▶ ▶ ▶ ▼ Artifact Library versioned store Metadata Index intent tags + deps Execution Engine deterministic run Audit Log full trace ▶ ▶ ▶ reuse loop
  12. Human Oversight: The Governance Core Reframe: Human oversight is not

    a bottleneck. It is the catalyst for reliability. Verify Correctness Domain experts catch plausible-but-wrong reasoning that automated tests miss entirely — the hallucination no test suite can detect. Ensure Compliance Regulatory and risk policy review before any workflow reaches the skill store. Codified artifacts make this tractable. Interpret Ambiguity Align model actions with strategic intent, not a literal reading of the request. Humans provide context the model cannot infer. Detect Drift Early Skilled reviewers develop intuition for when agent behavior begins to deviate from expected patterns — before it compounds systemically.
  13. Human-in-the-Loop Governance Workflow SYSTEM Agent Output + Trace ▶ HUMAN

    Domain Expert Review ▶ HUMAN Compliance Check ▶ HUMAN Intent Alignment ▶ DECISION Approve / Reject Decision ▶ SYSTEM Codify & Store Artifact ▲ REJECT — feedback to agent, reasoning restarts What reviewers check ▶ Factual correctness of reasoning ▶ Missing or incorrect steps ▶ Plausible-but-wrong conclusions ▶ Safety validations present? Compliance gates ▶ Regulatory scope (GDPR, HIPAA, SOX…) ▶ Risk classification: low / med / high ▶ Required approvers by tier ▶ Audit evidence captured? Alignment checks ▶ Intent matches org strategy? ▶ Scope bounded correctly? ▶ Downstream impact assessed? ▶ Stakeholders notified? Timing Guide Low-risk workflow < 1 hr Medium-risk workflow < 4 hrs High-risk workflow < 24 hrs Pay ONCE per workflow type never again Human review is a one-time cost per workflow type. Once approved, the artifact runs deterministically
  14. Capturing the Success Path Three activities that transform a one-time

    win into a permanent asset 1 Record the Reasoning Log key decisions, tools invoked, and action sequence. Capture reasoning chains even when full chain-of-thought is not exposed by the model. ▶ 2 Identify Essential Steps Human experts distinguish causally necessary steps from incidental ones — trimming the path to its minimal sufficient form. ▶ 3 Map Dependencies Record every data source, external API, and contextual variable. Build a complete dependency manifest for future reuse validation. Once validated by a human expert, this path becomes the blueprint for a deterministic artifact.
  15. Turning Success Into Deterministic Artifacts PRIMARY Executable Code Scripts, API

    sequences, functions Automation pipelines Pre-validated logic paths Errors are explicit, never silent CONTEXT Documentation Workflow purpose & rationale Constraint descriptions Audit instructions for reviewers Decision boundary explanations GUARDRAILS Metadata & Rules Safety checks & validation logic Required inputs / expected outputs Exception handling conditions Reuse criteria & preconditions Code does not hallucinate. It either executes correctly or returns a clear error — enabling true auditability.
  16. Why Code Wins Three properties that probabilistic outputs can never

    match Testable Unit tests, integration tests, regression suites. Every behavior can be specified and verified. A test suite for a prompt is trying to sample a probability distribution — you can never cover it. Observable Full execution trace. Every API call logged. Every decision point recorded. When something goes wrong in code, you know exactly where. When a model output is wrong, you're guessing. Auditable A compliance auditor can read a Python script. They can trace logic, verify constraints, and confirm the decision boundary. They cannot audit a neural network's inference path. Moving logic from model weights into code is the single highest-leverage architectural decision in GenAI systems.
  17. The Agent Skill Store Persistent enterprise memory — not ephemeral

    conversation context 1 Recognize goal LLM matches intent to metadata patterns ▶ 2 Retrieve artifact Pull code, docs, & rules from store ▶ 3 Validate conditions Check preconditions match current state ▶ 4 Execute deterministically Run artifact, no LLM tokens ▶ 5 Return result Consistent, auditable, no improvisation LLM role shifts: Primary executor → Router to proven tools | Generates new reasoning only for genuinely novel problems The Reliability Flywheel More solved problems → Larger artifact library Larger library → Less LLM improvisation Less improvisation → Lower cost + higher reliability End State Composed automation platform. LLM handles genuine novelty. Artifacts handle everything it has mastered.
  18. Skill Store: Reference Architecture INGESTION PIPELINE CORE SKILL STORE RUNTIME

    ENGINE Validated Reasoning Trace Human Review Approval Record Dependency Manifest Code Generator ▼ ▼ ▼ ▶ ARTIFACT RECORD → executable_code.py → documentation.md → metadata.json + rules Version Control (semver + git-hash) Metadata Index intent tags | domain tags | dep fingerprints | preconditions Immutable Audit Log (every read + write) ▼ ▼ ▼ ▶ Intent Classifier Precondition Validator Artifact Executor Result + Trace Return ▼ ▼ ▼ Kong AI Gateway | Auth · Rate Limiting · Policy Enforcement · Observability · Audit
  19. The Division of Labor: Creativity vs Procedure LLM: Creative Reasoning

    NOVEL PROBLEMS ▶ Problem has never been seen before ▶ No existing artifact matches intent ▶ Requires synthesis of disparate context ▶ Solution path is genuinely ambiguous ▶ Exploration and hypothesis generation Output: candidate solution → human review → codify Artifact: Deterministic Execution KNOWN PROBLEMS ▶ Pattern matches a stored artifact ▶ Preconditions validated as met ▶ Execution path is already proven ▶ No new reasoning tokens needed ▶ Consistent, auditable, repeatable Output: deterministic result, full trace, zero improvisation
  20. AI Agent Lifecycle: From Request to Production AI AGENT 01

    Task Received intent parsed 02 Skill Store Lookup known problem? 03 YES: Route to Artifact skip reasoning 04 Deterministic Execution artifact runs 05 NO: LLM Reasoning novel problem 06 Human Validation approve / reject 07 Codify & Version artifact created 08 Store & Index skill store grows YES NO RESULT Flywheel KNOWN PATH: 01 → 02 → 03 → 04 → RESULT ··· NOVEL PATH (dashed): 01 → 02 → 05 → 06 → 07 → 08 → REFEEDS 02 REJECT ▲ NO
  21. Consequences of the Architecture Shift Hallucination Reduction ✓ Pre-validated artifacts

    reused without alteration ✓ Model delegates to trusted code for sensitive steps ✓ Code returns explicit errors, not plausible fiction ✓ Planning drift eliminated for known workflows Security Implications ✓ Deterministic workflows reduce attack surface ✓ Prompt injection scope bounded to novel tasks only ✓ Decision logic is inspectable and threat-modelable ✓ Artifact versioning enables rollback on compromise Operational Economics ✓ Known tasks skip LLM inference entirely ✓ Latency drops from seconds to milliseconds ✓ Token costs compound downward as store grows ✓ Onboarding: use library, skip prompt mastery This is not just a reliability play. It is simultaneously a security, governance, and economics play.
  22. Practical Example: Automated Network Diagnosis Prompt-Based (Before) ✗ Agent guesses

    which logs to check ✗ May miss a critical diagnostic step ✗ Re-plans from scratch every incident ✗ High token cost per execution ✗ Results vary across identical incidents Artifact-Driven (After) ✓ First incident: agent reasons, human validates ✓ Diagnostic script generated & saved to store ✓ Next incident: pattern matched → script runs ✓ Millisecond execution, minimal token cost ✓ Consistent results — intelligence crystallized VS "The results are consistent, the latency is lower, and the cost is reduced because the model is not generating thousands of reasoning tokens. The intelligence has been crystallized into a tool."
  23. The Kong Perspective Governing the full AI interaction path Kong’s

    AI Gateway governs every layer of the interaction path — from the LLM call to the artifact execution to the audit log. Manage Skills, Not Prompts Platform teams maintain a versioned skill store — not a collection of fragile prompt templates. The artifact is the unit of governance. Govern the Execution Path Every artifact invocation passes through the gateway: authentication, rate limiting, logging, policy enforcement — before execution. Institutional Knowledge Becomes Code The expertise encoded in the skill store is organizational IP. It is versionable, transferable, and independent of any individual engineer. From Experimentation to Production This architecture is the bridge. Prompt-only systems stay in the pilot phase. Skill-store architectures can own production workloads.
  24. The Implementation Blueprint Novel Problem LLM explores ▶ Human Validation

    Expert confirms ▶ Codify Artifact Code + Docs + Rules ▶ Skill Store Persistent library ▶ Deterministic Execution Route & run Stop trying to outsmart the model Prompts discover solutions. Code executes them. Separating these roles is the key insight. Hard-code the success Every validated path is an organizational asset. Version it. Test it. Document it. Store it. Creativity for novelty, determinism for recurrence LLMs handle what's new. Artifacts handle what's known. Clarity on this boundary is everything.
  25. Hard-Code the Success "The future of enterprise AI belongs not

    to those who write the cleverest prompts, but to the architects who build systems capable of capturing and hardening success." "Creativity solves new problems. Deterministic code solves recurring ones. Build a system that knows the difference." Stop hacking prompts Capture success Codify success Reuse success QCon AI Boston 2026 | Thank you | Questions?
  26. THANK YOU Ready for what’s next? Let’s talk Connect with

    me: xX (Twitter): @hguerreroo Bluesky: hguerreroo.bsky.social YouTube: hguerreroo LinkedIn: hugoguerrero LinkedIn Scan to connect on LinkedIn Ready for what’s next? Let’s talk at the booth THANK YOU @hguerreroo
  27. BACKUP Key Concepts Quick Reference Architectural Determinism System design pattern

    that surrounds probabilistic LLMs with validation loops, human oversight, and deterministic artifacts to ensure reliable, auditable outcomes. Skill Store Persistent, versioned repository of validated codified execution artifacts. Different from context window — persists across sessions, scales organizationally. Success Path Capture Systematic recording of reasoning, essential steps, and dependencies from a validated execution — the precursor to codification. Planning Drift Gradual deviation of an agent's replanning across repeated executions of the same task. A primary source of hallucination in agentic systems. Deterministic Artifact Executable code, documentation, and metadata rules derived from a validated success path. Reusable without LLM involvement for known problems. Reliability Flywheel The compounding effect whereby a growing skill store reduces LLM invocation frequency, simultaneously lowering cost and increasing reliability over time.