datadog dash 2025 LLM observability for reliability and stability

LLM Observability for Reliability & Stability DASH 2025 Hiroyuki Moriya
AI Engineer IVRy.Inc. 2025/06/11

2 LLMs can do anything. Easily build amazing products. Perfect
self-driving cars within a few years. Automate everything so humans don’t need to work anymore.

3 Easily build amazing products. Perfect self-driving cars within a
few years. Automate everything so humans don’t need to work anymore. LLMs can do anything. Not really.

4 What is necessary for providing LLM-powered service in production?
Today’s Topic

5 About me Develop products that integrate LLM APIs Monitor
& optimize LLM APIs for reliability and performance Hiroyuki Moriya AI engineer / SRE Speaker Intro

6 About IVRy Challenges Solutions Recap & Tips Agenda

7 Founded/HQ: 2019. Tokyo, Japan Number of employees: 200+ people
Product: AI/LLM based phone communication service Reach: 30,000+ accounts & 40 million+ incoming calls in total IVRy inc. Company Info “Revolutionizing the telephone experience and boosting productivity for businesses ”

8 AI-powered automated phone service Our Product

9 Simpliﬁed system architecture

12 Phone calls are still important communication tools in Japan
Source: Rakuten Communications, “Survey on Call Handling at Small and Medium-Sized Businesses.”

13 Trusted across industries

14 Medical appointments Restaurant reservations Hotel bookings FAQ inquiries We
power phone communication with AI for businesses of all sizes IVRy in action

16 Three key challenges for AI phone service Robust fault
detection & recovery Challenge #3 Minimizing hallucinations Challenge #1 Ensuring natural conversation pace Challenge #2

18 Three solutions for AI phone service Robust fault detection
& recovery Solution #3 Ensuring natural conversation pace Solution #2 Minimizing hallucinations Solution #1

19 LLMs can hallucinate Problem

20 Divide and conquer Solution #1-1

21 Example AI workﬂow Break down a task into multiple
specialized AI components. → Beer validation and error analysis, leading to more stable & reliable results.

22 An example of AI workﬂow in action

23 Outputs from LLM APIs can change due to silent
model updates Problem Output has changed

24 Trust, but verify Solution #1-2

25 Monitor LLM API consistency every day Solution 1. Test
cases 2. Run consistency tests 3. Notify / record results

26 Automated phone E2E test

30 Executing phone E2E tests after code merge Merge code
Deploy latest code Execute automated phone E2E tests Monitor on Datadog LLM Observability

31 Monitoring with Datadog LLM observability

32 Categorizing topics of test cases

33 Reservation Cancellation Question Categorizing topics of test cases

34 To minimize hallucinations, 1 Divide and conquer Divide one
task into multiple, easier steps. Trust, but verify Verify LLM API responses regularly. 2 Summary 34

35 Three solutions for AI phone service Minimizing hallucinations Solution
#1 Ensuring natural conversation pace Solution #2 Robust fault detection & recovery Solution #3

36 Slow dialogue could miss oportunities Problem

37 Done is beer than perfect Solution #2-1

38 Fast, stable, and cheap Slower, more $$$ We choose
fast, proven models over cuing-edge but slow ones—beer latency, fewer rate limits, lower cost. Stability & performance > latest models

39 See the forest for the tree Solution #2-2

40 Monitor metrics with Datadog Inferred Services

41 Metrics are shown on inferred services page

42 To ensure natural conversation pace, 1 Done is beer
than perfect Choose the model that aligns with your case. See the forest for the tree See the overall metrics for each client. 2 Summary 42

43 Three solutions for AI phone service Robust fault detection
& recovery Solution #3 Minimizing hallucinations Solution #1 Ensuring natural conversation pace Solution #2

44 System failure could cause fatal issues Problem

45 LLM APIs connection is not stable. Connectivity issues happen
frequently. LLM API Status in one day

46 Prepare for the worst Solution #3

47 Created monitoring alerts using custom metrics.

48 Built a robust fallback system using multiple LLMs. It
routes requests based on API statuses. LLM fallback strategy

49 LLM APIs are called from LiteLLM proxy sidecars ECS
Cluster LLM APIs

50 Using LiteLLM proxy for other applications

51 Emergency phone transfer

52 To implement the robust fault detection and recovery, Prepare
for the worst Think the worst scenario and implement the robust recovery system. Summary 52

53 Key lessons for operating LLM APIs Divide and conquer
/ Trust, but verify 01 for minimizing hallucinations Done is beer than perfect / See the forest for the tree 02 for ensuring natural conversation pace Prepare for the worst 03 for robust fault detection & recovery

Thank you! Hiroyuki Moriya AI Engineer IVRy.Inc.

datadog dash 2025 LLM observability for reliabi...

datadog dash 2025 LLM observability for reliability and stability

More Decks by 株式会社IVRy（社員登壇資料）

Other Decks in Programming

Featured

Transcript