LLM Observability for
Reliability & Stability
DASH 2025
Hiroyuki Moriya
AI Engineer
IVRy.Inc.
2025/06/11
Slide 2
Slide 2 text
2
LLMs can do anything.
Easily build amazing products.
Perfect self-driving cars within a few years.
Automate everything so humans don’t need to work anymore.
Slide 3
Slide 3 text
3
Easily build amazing products.
Perfect self-driving cars within a few years.
Automate everything so humans don’t need to work anymore.
LLMs can do anything. Not really.
Slide 4
Slide 4 text
4
What is necessary for providing
LLM-powered service
in production?
Today’s Topic
Slide 5
Slide 5 text
5
About me
Develop products that integrate LLM APIs
Monitor & optimize LLM APIs for reliability
and performance
Hiroyuki Moriya
AI engineer / SRE
Speaker Intro
Slide 6
Slide 6 text
6
About IVRy
Challenges
Solutions
Recap & Tips
Agenda
Slide 7
Slide 7 text
7
Founded/HQ: 2019. Tokyo, Japan
Number of employees: 200+ people
Product: AI/LLM based phone communication service
Reach: 30,000+ accounts & 40 million+ incoming calls in total
IVRy inc.
Company Info
“Revolutionizing the telephone experience
and boosting productivity for businesses ”
Slide 8
Slide 8 text
8
AI-powered automated phone service
Our Product
Slide 9
Slide 9 text
9
Simplified system architecture
Slide 10
Slide 10 text
10
Simplified system architecture
Slide 11
Slide 11 text
11
Simplified system architecture
Slide 12
Slide 12 text
12
Phone calls are still important
communication tools in Japan
Source: Rakuten Communications, “Survey on Call Handling at Small and Medium-Sized Businesses.”
Slide 13
Slide 13 text
13
Trusted across industries
Slide 14
Slide 14 text
14
Medical appointments Restaurant
reservations
Hotel
bookings
FAQ
inquiries
We power phone communication with AI
for businesses of all sizes
IVRy in action
Slide 15
Slide 15 text
15
About IVRy
Challenges
Solutions
Recap & Tips
Agenda
Slide 16
Slide 16 text
16
Three key challenges
for AI phone service
Robust fault
detection & recovery
Challenge #3
Minimizing
hallucinations
Challenge #1
Ensuring natural
conversation pace
Challenge #2
Slide 17
Slide 17 text
17
About IVRy
Challenges
Solutions
Recap & Tips
Agenda
Slide 18
Slide 18 text
18
Three solutions
for AI phone service
Robust fault
detection & recovery
Solution #3
Ensuring natural
conversation pace
Solution #2
Minimizing
hallucinations
Solution #1
Slide 19
Slide 19 text
19
LLMs can hallucinate
Problem
Slide 20
Slide 20 text
20
Divide and conquer
Solution #1-1
Slide 21
Slide 21 text
21
Example AI workflow
Break down a task into multiple specialized AI components.
→ Beer validation and error analysis, leading to more stable & reliable results.
Slide 22
Slide 22 text
22
An example of AI workflow in action
Slide 23
Slide 23 text
23
Outputs from LLM APIs can change
due to silent model updates
Problem
Output has changed
Slide 24
Slide 24 text
24
Trust, but verify
Solution #1-2
Slide 25
Slide 25 text
25
Monitor LLM API consistency
every day
Solution
1. Test cases 2. Run consistency tests 3. Notify / record results
33
Reservation Cancellation
Question
Categorizing
topics of test
cases
Slide 34
Slide 34 text
34
To minimize hallucinations,
1 Divide and conquer
Divide one task into multiple, easier steps.
Trust, but verify
Verify LLM API responses regularly.
2
Summary
34
Slide 35
Slide 35 text
35
Three solutions
for AI phone service
Minimizing
hallucinations
Solution #1
Ensuring natural
conversation pace
Solution #2
Robust fault
detection & recovery
Solution #3
Slide 36
Slide 36 text
36
Slow dialogue could miss oportunities
Problem
Slide 37
Slide 37 text
37
Done is beer than perfect
Solution #2-1
Slide 38
Slide 38 text
38
Fast, stable, and cheap Slower, more $$$
We choose fast, proven
models over cuing-edge
but slow ones—beer
latency, fewer rate limits,
lower cost.
Stability &
performance >
latest models
Slide 39
Slide 39 text
39
See the forest for the tree
Solution #2-2
Slide 40
Slide 40 text
40
Monitor metrics
with Datadog
Inferred Services
Slide 41
Slide 41 text
41
Metrics are shown on inferred services page
Slide 42
Slide 42 text
42
To ensure natural conversation pace,
1 Done is beer than perfect
Choose the model that aligns with
your case.
See the forest for the tree
See the overall metrics for each client.
2
Summary
42
Slide 43
Slide 43 text
43
Three solutions
for AI phone service
Robust fault
detection & recovery
Solution #3
Minimizing
hallucinations
Solution #1
Ensuring natural
conversation pace
Solution #2
Slide 44
Slide 44 text
44
System failure could cause fatal issues
Problem
Slide 45
Slide 45 text
45
LLM APIs connection is not stable.
Connectivity issues happen frequently.
LLM API Status in one day
Slide 46
Slide 46 text
46
Prepare for the worst
Solution #3
Slide 47
Slide 47 text
47
Created monitoring alerts using custom metrics.
Slide 48
Slide 48 text
48
Built a robust fallback
system using multiple LLMs.
It routes requests based on
API statuses.
LLM fallback
strategy
Slide 49
Slide 49 text
49
LLM APIs are called from
LiteLLM proxy sidecars
ECS Cluster
LLM APIs
Slide 50
Slide 50 text
50
Using LiteLLM
proxy for other
applications
Slide 51
Slide 51 text
51
Emergency
phone transfer
Slide 52
Slide 52 text
52
To implement the robust fault detection
and recovery,
Prepare for the worst
Think the worst scenario and implement
the robust recovery system.
Summary
52
Slide 53
Slide 53 text
53
Key lessons for
operating LLM
APIs
Divide and conquer /
Trust, but verify
01
for minimizing hallucinations
Done is beer than perfect /
See the forest for the tree
02
for ensuring natural conversation pace
Prepare for the worst
03
for robust fault detection & recovery