form. They want to speak with your agent the way they talk to their friends. contact_form.html.twig ▢ first name Marie ▢ last name Dupont ▢ email m.dupont@… ▢ phone +33 6 … 12 ▢ reason for contact callback about order ▢ preferred callback slot tomorrow, 3pm ▢ country France ▢ account id 4521 ▢ captcha — /* friction */ 03 / 24
form. They want to speak with your agent the way they talk to their friends. extracted → first name last name email phone reason for contact preferred callback slot country account id captcha chat input Same content, smaller intent but the user still has to type every word and could forget informations. “Hi, Marie Dupont here. I’m calling about an issue with order 4521, any chance of a callback tomorrow around 3pm? You can reach me at 06 12 34 56 78.” Marie Dupont m.dupont@… +33 6 … 12 callback about order tomorrow, 3pm France 4521 — 03 / 24
form. They want to speak with your agent the way they talk to their friends. extracted → first name last name email phone reason for contact preferred callback slot country account id captcha “Hi, Marie Dupont here. I’m calling about an issue with order 4521, any chance of a callback tomorrow around 3pm? You can reach me at 06 12 34 56 78.” Marie Dupont m.dupont@… +33 6 … 12 callback about order tomorrow, 3pm France 4521 — 03 / 24
agent The same domain logic, tools, memory. Nothing duplicated. 02 Two modalities Text most of the time. Voice when available. Same brain, same API. 03 A transparent wrapper STT + TTS glued to the agent you already have. 04 / 24
Dupont here. I’m calling about an issue with order 4521, any chance of a callback tomorrow around 3pm?" "Got it, Marie. I’ve booked a callback on order 4521 for tomorrow at 3pm." M LLM gpt-4o · claude · … prompt how it behaves tools what it can do memory what it remembers agent 05 / 24
Dupont here. I’m calling about an issue with order 4521, any chance of a callback tomorrow around 3pm?" "Got it, Marie. I’ve booked a callback on order 4521 for tomorrow at 3pm." M LLM gpt-4o · claude · … prompt how it behaves tools what it can do memory what it remembers agent Same agent. Same output. Different behavior. VOICE "Hi, Marie Dupont here. I’m calling about an issue with order 4521, any chance of a callback tomorrow around 3pm?" STT audio → text TTS text → audio "Got it, Marie. I’ve booked a callback on order 4521 for tomorrow at 3pm." M LLM gpt-4o · claude · … prompt how it behaves tools what it can do memory what it remembers agent 05 / 24
Dupont here. I’m calling about an issue with order 4521, any chance of a callback tomorrow around 3pm?" "Got it, Marie. I’ve booked a callback on order 4521 for tomorrow at 3pm." M LLM gpt-4o · claude · … prompt how it behaves tools what it can do memory what it remembers agent Same agent. Same output. Different behavior. VOICE "Hi, Marie Dupont here. I’m calling about an issue with order 4521, any chance of a callback tomorrow around 3pm?" STT audio → text TTS text → audio "Got it, Marie. I’ve booked a callback on order 4521 for tomorrow at 3pm." M LLM gpt-4o · claude · … prompt how it behaves tools what it can do memory what it remembers agent 05 / 24
the seconds go. user · transcribing… What's my balance? → agent · idle ~ 80 ~ 120 ~ 250 01 · STT running total · 1/3 phases ~450 ms mic → server 80 ms VAD close 120 ms STT 250 ms 02 · AGENT 03 · TTS 06 / 24
the seconds go. user · said What's my balance? → agent · thinking… ~ 80 ~ 120 ~ 250 ~ 400 01 · STT 02 · AGENT running total · 2/3 phases ~850 ms mic → server 80 ms VAD close 120 ms STT 250 ms LLM first token 400 ms 03 · TTS 06 / 24
the seconds go. user · said What's my balance? → agent · speaking Your balance is $1,247. ~ 80 ~ 120 ~ 250 ~ 400 ~ 180 ~ 70 01 · STT 02 · AGENT 03 · TTS running total · 3/3 phases ~1100 ms mic → server 80 ms VAD close 120 ms STT 250 ms LLM first token 400 ms TTS first chunk 180 ms network out 70 ms 06 / 24
the seconds go. user · said What's my balance? → agent · speaking Your balance is $1,247. ~ 80 ~ 120 ~ 250 ~ 400 ~ 180 ~ 70 01 · STT 02 · AGENT 03 · TTS ~1100 ms mic → server 80 ms VAD close 120 ms STT 250 ms LLM first token 400 ms TTS first chunk 180 ms network out 70 ms 06 / 24
but what about intent? < 500 ms Feels natural The baseline < 1000 ms Feels slow Slow tools calls, latency < 2000 ms User hangs up Network issues, failover, etc > 3000 ms The measure is the first audible syllable, not the full response. The first sound out of the speaker, that's the number the user feels. 07 / 24
Speech start Open the STT socket. 02 Speech end Close after 400ms of silence - The "turn". 03 interrupt If TTS is active when speech starts, cancel & restart. 04 turn closed Hand the transcript to the agent. Wait for the next start. 08 / 24
Speech start Open the STT socket. 02 Speech end Close after 400ms of silence - The "turn". 03 interrupt If TTS is active when speech starts, cancel & restart. 04 turn closed Hand the transcript to the agent. Wait for the next start. 08 / 24
Speech start Open the STT socket. 02 Speech end Close after 400ms of silence - The "turn". 03 back to 01 interrupt If TTS is active when speech starts, cancel & restart. 04 turn closed Hand the transcript to the agent. Wait for the next start. 08 / 24
Speech start Open the STT socket. 02 Speech end Close after 400ms of silence - The "turn". 03 back to 01 interrupt If TTS is active when speech starts, cancel & restart. 04 turn closed Hand the transcript to the agent. Wait for the next start. 08 / 24
USER · MIC listening ↓ THE USER STARTS TALKING 1 The user interrupt the agent VAD fires the event 2 The current session is cancelled TTS stream is killed 3 A new turn opens Once the STT / TTS turn is down, the agent answer 09 / 24
USER · MIC speaking… ↓ OPEN THE NEW TURN 1 The user interrupt the agent VAD fires the event 2 The current session is cancelled TTS stream is killed 3 A new turn opens Once the STT / TTS turn is down, the agent answer 09 / 24
NEW TURN USER · MIC done Pending TTS bytes are dropped, not buffered 1 The user interrupt the agent VAD fires the event 2 The current session is cancelled TTS stream is killed 3 A new turn opens Once the STT / TTS turn is down, the agent answer 09 / 24
stay the same The same Symfony you already know. Autowired services, no separate runtime to think about. 02 Configuration as glue Semantic configuration glues STT and TTS together. One change, cache:clear, and continue. 03 Your existing stack Doctrine and Redis for memory. Messenger for async. Monolog for logs and traces. Your stack as it already is. 04 PHP, almost end to end Need real-time feedback? LiveComponents to the rescue, with a sprinkle of JavaScript. 10 / 24
final class FooService 4 { 5 public function __construct( 6 private readonly AgentInterface $support, 7 ) {} 8 9 public function handle(string $base64encodedInput): string 10 { 11 $path = $this->filesystem->tempnam(sys_get_temp_dir(), 'audio-', '.wav'); 12 $this->filesystem->dumpFile($path, base64_decode($base64audio)); 13 14 $result = $this->support->call(new MessageBag( 15 Message::ofUser(Audio::fromFile($path)), 16 )); 17 18 return $result->asText(); 19 } 20 } Audio is just content. The input arrives as a base64 string sent by the JavaScript (LiveComponents for example). 14 / 24
final class FooService 4 { 5 public function __construct( 6 private readonly AgentInterface $support, 7 ) {} 8 9 public function handle(string $base64encodedInput): string 10 { 11 $path = $this->filesystem->tempnam(sys_get_temp_dir(), 'audio-', '.wav'); 12 $this->filesystem->dumpFile($path, base64_decode($base64audio)); 13 14 $result = $this->support->call(new MessageBag( 15 Message::ofUser(Audio::fromFile($path)), 16 )); 17 18 return $result->asText(); 19 } 20 } Audio is just content. We write it to a file to retrieve the bytes. A stream would work just as well. 14 / 24
final class FooService 4 { 5 public function __construct( 6 private readonly AgentInterface $support, 7 ) {} 8 9 public function handle(string $base64encodedInput): string 10 { 11 $path = $this->filesystem->tempnam(sys_get_temp_dir(), 'audio-', '.wav'); 12 $this->filesystem->dumpFile($path, base64_decode($base64audio)); 13 14 $result = $this->support->call(new MessageBag( 15 Message::ofUser(Audio::fromFile($path)), 16 )); 17 18 return $result->asText(); 19 } 20 } Audio is just content. Wrap the file in an Audio object and send it through the MessageBag. 14 / 24
any AgentInterface. same call(), same contract. 2 $voice = new SpeechAgent( 3 agent: $supportAgent, 4 configuration: new SpeechConfiguration( 5 sttModel: 'whisper', 6 ttsModel: 'eleven_multilingual_v2', 7 ttsOptions: ['voice' => 'Dslrhjl3ZpzrctukrQSN'], 8 ), 9 speechToTextPlatform: $openai, 10 textToSpeechPlatform: $elevenlabs, 11 ); 1 // 2. audio in → audio out. tools, memory, prompt: untouched. 2 $result = $voice->call(new MessageBag( 3 Message::ofUser(Audio::fromFile('/tmp/answer.wav')), 4 )); 1 // 3. bytes in the body, the LLM's text rides in metadata. 2 $audio = $result->getContent(); // mp3 bytes 3 $text = $result->getMetadata()->get('text'); // "Your balance is …" 1 Decorator. It wraps any AgentInterface. Same call(), same contract. 2 Audio is content. Voice rides inside UserMessage as Audio. STT only runs when the latest user message has audio. 3 Both legs are optional. STT-only, TTS-only, or both — decided by which platforms + models you hand to SpeechConfiguration. 16 / 24
any AgentInterface. same call(), same contract. 2 $voice = new SpeechAgent( 3 agent: $supportAgent, 4 configuration: new SpeechConfiguration( 5 sttModel: 'whisper', 6 ttsModel: 'eleven_multilingual_v2', 7 ttsOptions: ['voice' => 'Dslrhjl3ZpzrctukrQSN'], 8 ), 9 speechToTextPlatform: $openai, 10 textToSpeechPlatform: $elevenlabs, 11 ); 1 Decorator. It wraps any AgentInterface. Same call(), same contract. 1 // 2. audio in → audio out. tools, memory, prompt: untouched. 2 $result = $voice->call(new MessageBag( 3 Message::ofUser(Audio::fromFile('/tmp/answer.wav')), 4 )); 1 // 3. bytes in the body, the LLM's text rides in metadata. 2 $audio = $result->getContent(); // mp3 bytes 3 $text = $result->getMetadata()->get('text'); // "Your balance is …" 2 Audio is content. Voice rides inside UserMessage as Audio. STT only runs when the latest user message has audio. 3 Both legs are optional. STT-only, TTS-only, or both — decided by which platforms + models you hand to SpeechConfiguration. 16 / 24
any AgentInterface. same call(), same contract. 2 $voice = new SpeechAgent( 3 agent: $supportAgent, 4 configuration: new SpeechConfiguration( 5 sttModel: 'whisper', 6 ttsModel: 'eleven_multilingual_v2', 7 ttsOptions: ['voice' => 'Dslrhjl3ZpzrctukrQSN'], 8 ), 9 speechToTextPlatform: $openai, 10 textToSpeechPlatform: $elevenlabs, 11 ); 1 // 2. audio in → audio out. tools, memory, prompt: untouched. 2 $result = $voice->call(new MessageBag( 3 Message::ofUser(Audio::fromFile('/tmp/answer.wav')), 4 )); 1 Decorator. It wraps any AgentInterface. Same call(), same contract. 2 Audio is content. Voice rides inside UserMessage as Audio. STT only runs when the latest user message has audio. 1 // 3. bytes in the body, the LLM's text rides in metadata. 2 $audio = $result->getContent(); // mp3 bytes 3 $text = $result->getMetadata()->get('text'); // "Your balance is …" 3 Both legs are optional. STT-only, TTS-only, or both — decided by which platforms + models you hand to SpeechConfiguration. 16 / 24
any AgentInterface. same call(), same contract. 2 $voice = new SpeechAgent( 3 agent: $supportAgent, 4 configuration: new SpeechConfiguration( 5 sttModel: 'whisper', 6 ttsModel: 'eleven_multilingual_v2', 7 ttsOptions: ['voice' => 'Dslrhjl3ZpzrctukrQSN'], 8 ), 9 speechToTextPlatform: $openai, 10 textToSpeechPlatform: $elevenlabs, 11 ); 1 // 2. audio in → audio out. tools, memory, prompt: untouched. 2 $result = $voice->call(new MessageBag( 3 Message::ofUser(Audio::fromFile('/tmp/answer.wav')), 4 )); 1 // 3. bytes in the body, the LLM's text rides in metadata. 2 $audio = $result->getContent(); // mp3 bytes 3 $text = $result->getMetadata()->get('text'); // "Your balance is …" 1 Decorator. It wraps any AgentInterface. Same call(), same contract. 2 Audio is content. Voice rides inside UserMessage as Audio. STT only runs when the latest user message has audio. 3 Both legs are optional. STT-only, TTS-only, or both — decided by which platforms + models you hand to SpeechConfiguration. 16 / 24
about to be read aloud. ✕ WHAT YOU HAVE TODAY "Error 422: Unprocessable Entity. Validation failed on field dot customer underscore id dot required." ✓ WHAT CALLERS DESERVE "I couldn't find that account. Could you read me the order number again?" 19 / 24
card" → "change my card." tool guardrails save you. 02 Model stalls First token doesn't arrive in 2s. You need a timeout and a filler. 03 TTS 429s Quota blown mid-sentence. Fall back to a second voice provider. 04 Audio socket drops Mobile networks do this. Reconnect and regenerate the file. 05 Tool throws The agent will happily hallucinate a refund confirmation. Don't let it. 06 User interrupts TTS is mid-sentence, they're already talking. You need barge-in. 20 / 24
What about observability? Nothing changes. The Symfony Profiler, the DataCollector, Monolog — every call is traced. 02 What about a provider going down? Declare a failover platform with two or more providers. The platform picks the live one, your agent never knows. 03 What about cache & cost? Decorate with the Cache platform — same prompt twice is free. Stack Failover on top, you're future-ready. 04 What about a slow first token? Network dependent, consider using the Cache platform or streaming the output. 05 What about a tool throwing? Every tool returns a structured, speakable error. The agent reads "I couldn't reach that account", never a stack trace. 22 / 24
What about observability? Nothing changes. The Symfony Profiler, the DataCollector, Monolog — every call is traced. 02 What about a provider going down? Declare a failover platform with two or more providers. The platform picks the live one, your agent never knows. 03 What about cache & cost? Decorate with the Cache platform — same prompt twice is free. Stack Failover on top, you're future-ready. 04 What about a slow first token? Network dependent, consider using the Cache platform or streaming the output. 05 What about a tool throwing? Every tool returns a structured, speakable error. The agent reads "I couldn't reach that account", never a stack trace. 22 / 24
What about observability? Nothing changes. The Symfony Profiler, the DataCollector, Monolog — every call is traced. 02 What about a provider going down? Declare a failover platform with two or more providers. The platform picks the live one, your agent never knows. 03 What about cache & cost? Decorate with the Cache platform — same prompt twice is free. Stack Failover on top, you're future-ready. 04 What about a slow first token? Network dependent, consider using the Cache platform or streaming the output. 05 What about a tool throwing? Every tool returns a structured, speakable error. The agent reads "I couldn't reach that account", never a stack trace. 22 / 24
What about observability? Nothing changes. The Symfony Profiler, the DataCollector, Monolog — every call is traced. 02 What about a provider going down? Declare a failover platform with two or more providers. The platform picks the live one, your agent never knows. 03 What about cache & cost? Decorate with the Cache platform — same prompt twice is free. Stack Failover on top, you're future-ready. 04 What about a slow first token? Network dependent, consider using the Cache platform or streaming the output. 05 What about a tool throwing? Every tool returns a structured, speakable error. The agent reads "I couldn't reach that account", never a stack trace. 22 / 24
What about observability? Nothing changes. The Symfony Profiler, the DataCollector, Monolog — every call is traced. 02 What about a provider going down? Declare a failover platform with two or more providers. The platform picks the live one, your agent never knows. 03 What about cache & cost? Decorate with the Cache platform — same prompt twice is free. Stack Failover on top, you're future-ready. 04 What about a slow first token? Network dependent, consider using the Cache platform or streaming the output. 05 What about a tool throwing? Every tool returns a structured, speakable error. The agent reads "I couldn't reach that account", never a stack trace. 22 / 24
Pick a model. Wire the tools and the prompt. The shape you already know. 01 Add ears and a mouth Plug an STT and a TTS provider through the SpeechAgent. Audio in, audio out. 02 Monitor and adapt Watch first-audible latency. Listen to real users. Tune voice, prompts, providers. 03 23 / 24