Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How Timee Delivers Day 1 Production Ready LLM ...

Avatar for tomoyks tomoyks
June 13, 2026

How Timee Delivers Day 1 Production Ready LLM Features

@DASH by Datadog 2026

Avatar for tomoyks

tomoyks

June 13, 2026

Other Decks in Technology

Transcript

  1. Feb. 5th, 2026 Two LLM features hit the same outage.

    FEATURE A ~5min to decide: no action needed FEATURE B ~3hrs down in production
  2. What happens when an LLM feature fails in production? What

    we learned. Why production readiness has to be default by Day 1.
  3. MLOps Engineer · Timee Tomoyuki Saito Platform engineering for ML

    and LLM workloads Observability and production readiness Bird Watching
  4. LLMs are becoming part of the product. Worker experience Better

    discovery, clearer information, and significantly less friction in finding the perfect next shift. Client experience Generation tools that help clients describe shifts faster, more accurately, and with higher conversion rates. Safety & Reliability Helping the platform be safer and more trustworthy across both sides through autonomous monitoring.
  5. Designed for failure from day one. Feature A Asynchronous workflow

    Retries and failure handling Fallback path
  6. Timee’s engineering organization API Call Deploy Deploy Platform Team (Platform

    Engineer) Stream Aligned Team (Product Engineer) Platform Team (MLOps Engineer) Complicated Subsystem Team (Data Scientist) ECS Cloud Run
  7. Timee’s engineering organization at LLM era API Call Deploy Deploy

    Platform Team (Platform Engineer) Stream Aligned Team (Product Engineer) Platform Team (MLOps Engineer) Complicated Subsystem Team (Data Scientist) ECS Cloud Run
  8. A checklist for LLM applications. GENERAL ENGINEERING Stability & Reliability

    CI/CD automation, Release process Scalability Auto-scaling capacity planning Fault Tolerance Blast radius Control, Runbooks Standard Monitoring System metrics, Structured logging Security Access control, Encryption LLM-SPECIFIC ADDITIONS Prompt Management Version control & regression testing Token & Cost Control TPM/RPM rate limiting, Cost budgeting Resilience Model fallback, Circuit breakers Guardrails & Safety PII masking, Hallucination filters LLM Observability TTFT tracking, Response tracing Continuous QA Golden datasets, Offline evaluation
  9. The Rails feature moved from PoC to production STEP 01

    PoC Job description generation — a natural LLM use case. → STEP 02 Strong ROI Clear business impact. Direct user-facing value. → STEP 03 Production Strong product signal. The feature shipped. Note: Operationally, readiness wasn't at the same level as the first Cloud Run feature.
  10. A checklist defines the expected state. It doesn't execute itself.

    PROBLEM 01 · DOC GAP PROBLEM 02 · HUMAN BOTTLENECK
  11. 20+ autonomous squads. 3-person MLOps team. PRODUCT SQUADS 01 02

    03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 MLOps +
  12. 20+ autonomous squads. 3-person MLOps team. PRODUCT SQUADS 01 02

    03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 MLOps + "We want to use LLMs"
  13. ARCHITECTURE DECISION LLM Gateway as common entry point. One internal

    HTTP endpoint. Every team. Every language. Every model.
  14. Risk vs Unlocking capabilities THREE STRATEGIC GOALS TO BALANCE THE

    ACKNOWLEDGED RISK Single Point of Failure THE STRATEGIC REASONING Observable and Controllable 01 02 03 Agility Reliability Governance VS
  15. The direction was right. The rollout was incomplete. JANUARY 2026

    First Gateway release FEBRUARY 5, 2026 The outage
  16. Visible LLM fallback DATADOG SCREENSHOT Cloud Run Request Error Count

    Total LLM Token Consumption Cloud Run Success Rate LLM Token Consumption Trend Feature A - Cloud Run/Python
  17. The Cost of Zero Visibility CRITICAL QUESTIONS WE COULDN'T ANSWER

    ? When will the LLM recover? ? How many requests are actually failing? ? Should we wait, switch models, or turn off the feature? RESULT: 3 HOURS OF DOWNTIME IN PRODUCTION Feature B - Rails/ECS
  18. Three good pieces. One missing connection. The pieces existed. But

    they weren't connected into a single shared path. 01 PIECE 1 Checklist Defined the expected state. 02 PIECE 2 Datadog Provided evidence — in some places. 03 PIECE 3 Gateway Started — not yet shared.
  19. THE LESSON Production readiness cannot stay as a checklist. It

    has to be built into the path teams naturally use.
  20. All LLM calls go through one common path. CALLERS ECS・Rails

    Job description gen Cloud Run · Python Job-flow LLM feature Batch · Python Offline / scheduled New use cases PoCs & future stacks LLM GATEWAY LiteLLM POST /v1/chat/completions MODELS PRIMARY Claude · Vertex AI FALLBACK Gemini EXPANSION Future hosted models LATER Open-weight + vLLM
  21. 01 Routing & fallback 02 Identity & tagging 03 Observability

    by Default 04 Governance & safety What every call gets, automatically
  22. Production readiness becomes something the platform helps deliver. CHECKLIST Defines

    the expected state What "ready" looks like. MONITORING Proves what is happening Evidence of readiness in the moment. GATEWAY Defaults into the path teams use Brings evidence to every call.
  23. The platform makes sure these don't depend on chance. STILL

    OWNED BY TEAMS Product UX and copy Business-specific fallback behavior Domain-aware prompt design Quality bar for their feature NO LONGER LEFT TO CHANCE ✕ No observability ✕ No cost visibility ✕ No clear fallback path ✕ No team-by-team variance
  24. Four things, every team, by default. 01 Faster incident decisions

    02 Cross-cloud tracing 03 Cost visibility from day one 04 Lower instrumentation burden
  25. From one external API to part of our infrastructure. 01

    More models More hosted models. More open-weight models we serve ourselves with vLLM. 02 Agent workflows Tool calls, multi-step reasoning, internal tool servers. The call graph gets harder to reason about. 03 MCP-connected systems External context, internal tools, distributed planning. End-to-end tracing becomes essential.
  26. For your own teams, regardless of stack. 01 A checklist

    alone is not enough. If it doesn't show up in the path teams use to ship, expect inconsistent outcomes the next time something breaks. 02 Pick one common path before you have five. Cost, fallback, and observability are far harder to retrofit than to standardize. If you have three teams today, that is the best time. 03 SDK gaps are an argument for the gateway pattern. Don't treat them as a blocker. Solve it once at the network boundary — every language benefits at the same time.
  27. One platform. Shared by every product teams. One default. Production-ready

    from day one. One outcome. No more two stories inside one company. That is Day 1 Production-Ready.