datadog dash 2025 LLM observability for reliability and stability

by 株式会社IVRy（社員登壇資料）

Slide 1

Slide 1 text

LLM Observability for Reliability & Stability DASH 2025 Hiroyuki Moriya AI Engineer IVRy.Inc. 2025/06/11

Slide 2

Slide 2 text

2 LLMs can do anything. Easily build amazing products. Perfect self-driving cars within a few years. Automate everything so humans don’t need to work anymore.

Slide 3

Slide 3 text

3 Easily build amazing products. Perfect self-driving cars within a few years. Automate everything so humans don’t need to work anymore. LLMs can do anything. Not really.

Slide 4

Slide 4 text

4 What is necessary for providing LLM-powered service in production? Today’s Topic

Slide 5

Slide 5 text

5 About me Develop products that integrate LLM APIs Monitor & optimize LLM APIs for reliability and performance Hiroyuki Moriya AI engineer / SRE Speaker Intro

Slide 6

Slide 6 text

6 About IVRy Challenges Solutions Recap & Tips Agenda

Slide 7

Slide 7 text

7 Founded/HQ: 2019. Tokyo, Japan Number of employees: 200+ people Product: AI/LLM based phone communication service Reach: 30,000+ accounts & 40 million+ incoming calls in total IVRy inc. Company Info “Revolutionizing the telephone experience and boosting productivity for businesses ”

Slide 8

Slide 8 text

8 AI-powered automated phone service Our Product

Slide 9

Slide 9 text

9 Simpliﬁed system architecture

Slide 10

Slide 10 text

10 Simpliﬁed system architecture

Slide 11

Slide 11 text

11 Simpliﬁed system architecture

Slide 12

Slide 12 text

12 Phone calls are still important communication tools in Japan Source: Rakuten Communications, “Survey on Call Handling at Small and Medium-Sized Businesses.”

Slide 13

Slide 13 text

13 Trusted across industries

Slide 14

Slide 14 text

14 Medical appointments Restaurant reservations Hotel bookings FAQ inquiries We power phone communication with AI for businesses of all sizes IVRy in action

Slide 15

Slide 15 text

15 About IVRy Challenges Solutions Recap & Tips Agenda

Slide 16

Slide 16 text

16 Three key challenges for AI phone service Robust fault detection & recovery Challenge #3 Minimizing hallucinations Challenge #1 Ensuring natural conversation pace Challenge #2

Slide 17

Slide 17 text

17 About IVRy Challenges Solutions Recap & Tips Agenda

Slide 18

Slide 18 text

18 Three solutions for AI phone service Robust fault detection & recovery Solution #3 Ensuring natural conversation pace Solution #2 Minimizing hallucinations Solution #1

Slide 19

Slide 19 text

19 LLMs can hallucinate Problem

Slide 20

Slide 20 text

20 Divide and conquer Solution #1-1

Slide 21

Slide 21 text

21 Example AI workﬂow Break down a task into multiple specialized AI components. → Beer validation and error analysis, leading to more stable & reliable results.

Slide 22

Slide 22 text

22 An example of AI workﬂow in action

Slide 23

Slide 23 text

23 Outputs from LLM APIs can change due to silent model updates Problem Output has changed

Slide 24

Slide 24 text

24 Trust, but verify Solution #1-2

Slide 25

Slide 25 text

25 Monitor LLM API consistency every day Solution 1. Test cases 2. Run consistency tests 3. Notify / record results

Slide 26

Slide 26 text

26 Automated phone E2E test

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

30 Executing phone E2E tests after code merge Merge code Deploy latest code Execute automated phone E2E tests Monitor on Datadog LLM Observability

Slide 31

Slide 31 text

31 Monitoring with Datadog LLM observability

Slide 32

Slide 32 text

32 Categorizing topics of test cases

Slide 33

Slide 33 text

33 Reservation Cancellation Question Categorizing topics of test cases

Slide 34

Slide 34 text

34 To minimize hallucinations, 1 Divide and conquer Divide one task into multiple, easier steps. Trust, but verify Verify LLM API responses regularly. 2 Summary 34

Slide 35

Slide 35 text

35 Three solutions for AI phone service Minimizing hallucinations Solution #1 Ensuring natural conversation pace Solution #2 Robust fault detection & recovery Solution #3

Slide 36

Slide 36 text

36 Slow dialogue could miss oportunities Problem

Slide 37

Slide 37 text

37 Done is beer than perfect Solution #2-1

Slide 38

Slide 38 text

38 Fast, stable, and cheap Slower, more $$$ We choose fast, proven models over cuing-edge but slow ones—beer latency, fewer rate limits, lower cost. Stability & performance > latest models

Slide 39

Slide 39 text

39 See the forest for the tree Solution #2-2

Slide 40

Slide 40 text

40 Monitor metrics with Datadog Inferred Services

Slide 41

Slide 41 text

41 Metrics are shown on inferred services page

Slide 42

Slide 42 text

42 To ensure natural conversation pace, 1 Done is beer than perfect Choose the model that aligns with your case. See the forest for the tree See the overall metrics for each client. 2 Summary 42

Slide 43

Slide 43 text

43 Three solutions for AI phone service Robust fault detection & recovery Solution #3 Minimizing hallucinations Solution #1 Ensuring natural conversation pace Solution #2

Slide 44

Slide 44 text

44 System failure could cause fatal issues Problem

Slide 45

Slide 45 text

45 LLM APIs connection is not stable. Connectivity issues happen frequently. LLM API Status in one day

Slide 46

Slide 46 text

46 Prepare for the worst Solution #3

Slide 47

Slide 47 text

47 Created monitoring alerts using custom metrics.

Slide 48

Slide 48 text

48 Built a robust fallback system using multiple LLMs. It routes requests based on API statuses. LLM fallback strategy

Slide 49

Slide 49 text

49 LLM APIs are called from LiteLLM proxy sidecars ECS Cluster LLM APIs

Slide 50

Slide 50 text

50 Using LiteLLM proxy for other applications

Slide 51

Slide 51 text

51 Emergency phone transfer

Slide 52

Slide 52 text

52 To implement the robust fault detection and recovery, Prepare for the worst Think the worst scenario and implement the robust recovery system. Summary 52

Slide 53

Slide 53 text

53 Key lessons for operating LLM APIs Divide and conquer / Trust, but verify 01 for minimizing hallucinations Done is beer than perfect / See the forest for the tree 02 for ensuring natural conversation pace Prepare for the worst 03 for robust fault detection & recovery

Slide 54

Slide 54 text

Thank you! Hiroyuki Moriya AI Engineer IVRy.Inc.