Chatless-AI - Realtime-Sprachinterfaces für Web und Mobile entwickeln

Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web &
Mobile Sascha Lehmann @derLehmann_S Consultant

Consultant @ Thinktecture AG Sascha Lehmann @derLehmann_S https://www.linkedin.com/in/sascha-lehmann [email protected] https://www.thinktecture.com/thinktects/sascha-lehmann/
    Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile

Speech interaction until now • Each arrow represents 200-500ms of
latency • Like a zoom call with huge delay What ”real-time” really means Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Mic STT (Speech to Text) LLM TTS (Text to Speech) Speaker

With realtime model • Permanent available audio channel • It
no longer feels like software anymore, but like a real counterpart • Average response time 320ms What ”real-time” really means Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Mic STT (Speech to Text) Realtime Model TTS (Text to Speech) Speaker

Property OpenAI gpt-realtime OpenAI gpt-realtime-mini Gemini Live 2.5 Flash Architecture
Speech-to-Speech native Speech-to-Speech native Multimodal Live (Audio/Video/Text) Latency ~320ms (first response) ~320ms Sub-800ms Languages Multilingual, language change mid-sentence Multilingual 24 languages Particular strengths Best adherence to instructions, Tool Calling Cost efficient, quick Video-Input, Multimodal, costefficient Barge-In support Emotional adaption Configurable via prompt restricted Affective Dialog native Voices 10+ voices Like gpt-realtime 30+ HD Voices Latest models gpt-realtime-1.5 (Feb 2026) gpt-realtime-mini-2025-12-15 gemini-live-2.5-flash-native-audio (GA) Available Realtime Models Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile

High Level • Model processes audio directly (lowest latency) •
Thinks and responds in speech • Doesn’t rely on transcript • Hears emotion and intent • Filters noise Architecture https://developers.openai.com/api/docs/guides/voice-agents/ Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile

Transport – WebRTC vs. WebSocket Aspectsj WebRTC WebSocket Protocol UDP
(packet loss tolerant) TCP (guaranteed delivery) Latency Very low (~50–150ms) Higher (TCP Head-of-Line Blocking) Features Echo Cancellation, Noise Suppression, Jitter Buffer - Ideal for Browser-Apps, Client-side Server-side, Phone-integration Downsides NAT Traversal complexity Latency peeks during packet loss Architecture https://developers.openai.com/api/docs/guides/voice-agents/ Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile

Beispielanwendung animieren <Demo/>

Realtime Connection Chatless AI Architecting Realtime, Hands-Free AI Interfaces for
Web & Mobile App Backend RealtimeAPI System Prompt Ephemeral Key 1. RealtimeAPI-Key 2. 3. 3. 4. WebRTC

Client side connection Chatless AI Architecting Realtime, Hands-Free AI Interfaces
for Web & Mobile

Sending events Chatless AI Architecting Realtime, Hands-Free AI Interfaces for
Web & Mobile

UX Challenges Chatless AI Architecting Realtime, Hands-Free AI Interfaces for
Web & Mobile

• The user has no visual clue • Voice needs
a state indicator • Other methods • Audio queues • Sound effects 1. What is my current state? Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile

• Unique audio queue sound • Vocal feedback from the
agent • Visual indicator update 2. Success / Failure feedback Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile

• Automatic retry • Agent asks for clarification • Control
to restart the voice agent session 3. Error Handling Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile

• User needs the possibility to cancel an action •
Voice activity detection (VAD) • Also include manual cancellation 4. Interruption (Barge-In) Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile

• Voice is not the only interface – it is
an additional one • Realtime models are also ”Multimodal Models” • Speech • Text • Images • Gemini Live also supports Live Camera Input - So the model can see what you currently see Realtime Models do not only support speech Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile

How much does it cost? Chatless AI Architecting Realtime, Hands-Free
AI Interfaces for Web & Mobile

• ~32 Token/Sek (Input) • ~20-25 Token/Sek (Output) • Cost
increase with conversation length • Each turn contains existing context. Pricing model of each model Modell Input (Audio) Cached Input Output (Audio) ~cost per minute (conversation) gpt-realtime $32 / 1M Tokens $0.40 / 1M Tokens $64 / 1M Tokens ~$0.06–0.12 gpt-realtime- mini $10 / 1M Tokens $0.30 / 1M Tokens $20 / 1M Tokens ~$0.02–0.04 Gemini Live 2.5 Flash $3 / 1M Tokens (Audio) Caching verfügbar $12 / 1M Tokens ~$0.01–0.02 Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile

1. Make use of prompt caching 2. Make agent not
to talkative 3. Truncation & Retention Ration konfigurieren 4. Manage Conversation history manually 5. Make use of mini model 6. Proactive Tool calling without confirmation 7. Design clear conversation flows with a final end How to keep the costs under control Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile

• Average duration of a pocket depth measurement: ~ 3
min • With gpt-realtime-mini + caching: ~$0.04–0.08 per finding • With Gemini Live Flash: ~$0.03–0.06 per finding Example calculation Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile

Integration & Best Practices Chatless AI Architecting Realtime, Hands-Free AI
Interfaces for Web & Mobile

System Prompt Chatless AI Architecting Realtime, Hands-Free AI Interfaces for
Web & Mobile

Best Practices • Role & Objective — Clear identity +
task definition ("You are a dentist assistant, your task is form-filling") • Personality & Tone — Language, formality, response length, pacing explicitly set (e.g. "max 2 sentences", formal address, "professional & concise") • Reference Pronunciations — Phonetic guides for domain terms + tooth numbers that TTS would mispronounce ("one-eight" not "eighteen") • Tools / Function Calling — Per tool: trigger condition, preamble ("One moment..."), error handling (1x silent retry, 2x notify user, 3x fall back to manual) • Instructions / Rules — Bullet points over prose, CAPS for critical rules ("NEVER GUESS", "ALWAYS use get_selected_tooth first") • Conversation Flow — State machine for procedures (Pocket Depth: Init → Measurement Loop → Completion) • Safety & Escalation — Domain constraint, exact refusal scripts, escalation after 3x off-topic • Examples — Concrete input/output pairs for ambiguous cases System Prompt Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile

Key Techniques • Bullet Points > Prose — Model follows
short bullets better than paragraphs • Labeled Sections — Model navigates by heading; each section = one concern • Lock the Language — "Respond ONLY in German" stated explicitly, prevents drift • Unclear Audio Handling (3-Step) — 1. Ask to repeat → 2. Address audio quality → 3. Offer to skip • Rotating Confirmation Phrases — Provide a list of alternatives ("Understood", "Noted", "Alright") → prevents robotic repetition • Tool Preambles — Short sentence BEFORE the tool call so the user knows what's happening • Confirm Critical Data — For medical values always read back: "Tooth one-eight: three. Correct?" • Strict Domain Constraint — Exact scripted response for out-of-scope requests + escalation threshold System Prompt Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile

Security & Guardrails • No API-KEY in CLIENT! → Generate
Ephemeral Token via Server • Clear definitions of what is allowed and what is not – System Prompt • Escalation-Tool: If there is uncertainty → Abort or handoff to a human agent • DSGVO: Check in which country the model is running Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile

Summary • Realtime Voice ≠ Chatbot with microfone: It is
a complete new interaction paradigm • Speech-to-Speech is the way for low-latency hands-free scenarios • UX is the real challenge – clear and precise communication of state to the human user is key • Cost is controllable – with caching, mini-models, truncation and short sessions • Start small – one scenario, one tool, one demo • And MOST IMPORTANT: The implementation needs to provide a REAL benefit for the user Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile

Sascha Lehmann [email protected] Thank you!

Chatless-AI - Realtime-Sprachinterfaces für Web...

Chatless-AI - Realtime-Sprachinterfaces für Web und Mobile entwickeln

Sascha Lehmann

More Decks by Sascha Lehmann

Featured

Transcript

Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web &

Consultant @ Thinktecture AG Sascha Lehmann @derLehmann_S https://www.linkedin.com/in/sascha-lehmann [email protected] https://www.thinktecture.com/thinktects/sascha-lehmann/

Speech interaction until now • Each arrow represents 200-500ms of

With realtime model • Permanent available audio channel • It

Property OpenAI gpt-realtime OpenAI gpt-realtime-mini Gemini Live 2.5 Flash Architecture

High Level • Model processes audio directly (lowest latency) •

Transport – WebRTC vs. WebSocket Aspectsj WebRTC WebSocket Protocol UDP

Beispielanwendung animieren <Demo/>

Realtime Connection Chatless AI Architecting Realtime, Hands-Free AI Interfaces for

Client side connection Chatless AI Architecting Realtime, Hands-Free AI Interfaces

Sending events Chatless AI Architecting Realtime, Hands-Free AI Interfaces for

UX Challenges Chatless AI Architecting Realtime, Hands-Free AI Interfaces for

• The user has no visual clue • Voice needs

• Unique audio queue sound • Vocal feedback from the

• Automatic retry • Agent asks for clarification • Control

• User needs the possibility to cancel an action •

• Voice is not the only interface – it is

How much does it cost? Chatless AI Architecting Realtime, Hands-Free

• ~32 Token/Sek (Input) • ~20-25 Token/Sek (Output) • Cost

1. Make use of prompt caching 2. Make agent not

• Average duration of a pocket depth measurement: ~ 3

Integration & Best Practices Chatless AI Architecting Realtime, Hands-Free AI

System Prompt Chatless AI Architecting Realtime, Hands-Free AI Interfaces for

Best Practices • Role & Objective — Clear identity +

Key Techniques • Bullet Points > Prose — Model follows

Security & Guardrails • No API-KEY in CLIENT! → Generate

Summary • Realtime Voice ≠ Chatbot with microfone: It is

Sascha Lehmann [email protected] Thank you!