Chatless-AI – Realtime-Sprachinterfaces für Web und Mobile entwickeln

Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web &
Mobile Sascha Lehmann @derLehmann_S Consultant

Consultant @ Thinktecture AG Chatless AI Architecting Realtime, Hands-Free AI
Interfaces for Web & Mobile Sascha Lehmann @derLehmann_S https://www.linkedin.com/in/sascha-lehmann [email protected] https://www.thinktecture.com/thinktects/sascha-lehmann/    

Speech interaction until now • Each arrow represents 200-500ms of
latency • Like a zoom call with huge delay Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile What ”real-time” really means Mic STT (Speech to Text) LLM TTS (Text to Speech) Speaker

With realtime model • Permanent available audio channel • It
no longer feels like software anymore, but like a real counterpart • Average response time 320ms Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile What ”real-time” really means Mic STT (Speech to Text) Realtime Model TTS (Text to Speech) Speaker

Property OpenAI gpt-realtime OpenAI gpt-realtime-mini Gemini Live 2.5 Flash Architecture
Speech-to-Speech native Speech-to-Speech native Multimodal Live (Audio/Video/Text) Latency ~320ms (first response) ~320ms Sub-800ms Languages Multilingual, language change mid-sentence Multilingual 24 languages Particular strengths Best adherence to instructions, Tool Calling Cost efficient, quick Video-Input, Multimodal, costefficient Barge-In support Emotional adaption Configurable via prompt restricted Affective Dialog native Voices 10+ voices Like gpt-realtime 30+ HD Voices Latest models gpt-realtime-1.5 (Feb 2026) gpt-realtime-mini-2025-12-15 gemini-live-2.5-flash-native-audio (GA) Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Available Realtime Models

High Level • Model processes audio directly (lowest latency) •
Thinks and responds in speech • Doesn’t rely on transcript • Hears emotion and intent • Filters noise Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Architecture https://developers.openai.com/api/docs/guides/voice-agents/

Transport – WebRTC vs. WebSocket Aspects WebRTC WebSocket Protocol UDP
(packet loss tolerant) TCP (guaranteed delivery) Latency Very low (~50–150ms) Higher (TCP Head-of-Line Blocking) Features Echo Cancellation, Noise Suppression, Jitter Buffer - Ideal for Browser-Apps, Client-side Server-side, Phone-integration Downsides NAT Traversal complexity Latency peeks during packet loss Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Architecture https://developers.openai.com/api/docs/guides/voice-agents/

Beispielanwendung animieren <Demo/>

Mobile Realtime Connection App Backend RealtimeAPI 1 2 3 4 RealtimeAPI-Key System Prompt Ephemeral Key WebRTC

Mobile Client side connection

Mobile Sending events

Mobile UX Challenges

• The user has no visual clue • Voice needs
a state indicator • Other methods • Audio queues • Sound effects Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile 1. What is my current state?

• Unique audio cue sound • Vocal feedback from the
agent • Visual indicator update Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile 2. Success / Failure feedback

• Automatic retry • Agent asks for clarification • Control
to restart the voice agent session Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile 3. Error Handling

• User needs the possibility to cancel an action •
Voice activity detection (VAD) • Also include manual cancellation Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile 4. Interruption (Barge-In)

• Voice is not the only interface – it is
an additional one • Realtime models are also ”Multimodal Models” • Speech • Text • Images • Gemini Live also supports Live Camera Input - So the model can see what you currently see Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Realtime Models do not only support speech

Mobile How much does it cost?

• ~32 Token/Sek (Input) • ~20-25 Token/Sek (Output) • Cost
increase with conversation length • Each turn contains existing context. Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Pricing model of each model Modell Input (Audio) Cached Input Output (Audio) ~cost per minute (conversation) gpt-realtime $32 / 1M Tokens $0.40 / 1M Tokens $64 / 1M Tokens ~$0.06–0.12 gpt-realtime- mini $10 / 1M Tokens $0.30 / 1M Tokens $20 / 1M Tokens ~$0.02–0.04 Gemini Live 2.5 Flash $3 / 1M Tokens (Audio) Caching verfügbar $12 / 1M Tokens ~$0.01–0.02

1. Make use of prompt caching 2. Make agent not
to talkative 3. Truncation & Retention Ration konfigurieren 4. Manage Conversation history manually 5. Make use of mini model 6. Proactive Tool calling without confirmation 7. Design clear conversation flows with a final end Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile How to keep the costs under control

• Average duration of a pocket depth measurement: ~ 3
min • With gpt-realtime-mini + caching: ~$0.04–0.08 per finding • With Gemini Live Flash: ~$0.03–0.06 per finding Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Example calculation

Mobile Integration & Best Practices

Mobile System Prompt

Best Practices • Role & Objective — Clear identity +
task definition ("You are a dentist assistant, your task is form-filling") • Personality & Tone — Language, formality, response length, pacing explicitly set (e.g. "max 2 sentences", formal address, "professional & concise") • Reference Pronunciations — Phonetic guides for domain terms + tooth numbers that TTS would mispronounce ("one-eight" not "eighteen") • Tools / Function Calling — Per tool: trigger condition, preamble ("One moment..."), error handling (1x silent retry, 2x notify user, 3x fall back to manual) • Instructions / Rules — Bullet points over prose, CAPS for critical rules ("NEVER GUESS", "ALWAYS use get_selected_tooth first") • Conversation Flow — State machine for procedures (Pocket Depth: Init → Measurement Loop → Completion) • Safety & Escalation — Domain constraint, exact refusal scripts, escalation after 3x off-topic • Examples — Concrete input/output pairs for ambiguous cases Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile System Prompt

Key Techniques • Bullet Points > Prose — Model follows
short bullets better than paragraphs • Labeled Sections — Model navigates by heading; each section = one concern • Lock the Language — "Respond ONLY in German" stated explicitly, prevents drift • Unclear Audio Handling (3-Step) — 1. Ask to repeat → 2. Address audio quality → 3. Offer to skip • Rotating Confirmation Phrases — Provide a list of alternatives ("Understood", "Noted", "Alright") → prevents robotic repetition • Tool Preambles — Short sentence BEFORE the tool call so the user knows what's happening • Confirm Critical Data — For medical values always read back: "Tooth one-eight: three. Correct?" • Strict Domain Constraint — Exact scripted response for out-of-scope requests + escalation threshold Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile System Prompt

• No API-KEY in CLIENT! → Generate Ephemeral Token via
Server • Clear definitions of what is allowed and what is not – System Prompt • Escalation-Tool: If there is uncertainty → Abort or handoff to a human agent • DSGVO: Check in which country the model is running Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Security & Guardrails

How do I know a prompt change actually made things
better? Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Evaluation

Why it is hard • Two axes, not one —
A response can be right and still sound broken. Grade both separately. • Content quality — Did it do the right thing? Correctness, tool choice, tool arguments, instruction following. • Audio quality — Did it sound acceptable? Naturalness, pacing, pronunciation, behaviour under noise. • A turn is a pipeline — speech start/stop, commit, response, audio deltas. Log every stage to find the real root cause. • Transcript is not ground truth — The audio signal is the truth; a transcript is just a model’s interpretation and can be wrong. • False fails & false passes — ASR drops a digit the model heard correctly, or a clean transcript hides clipped audio the model guessed at. • Grade on transcripts + traces — Run most automated grading on transcripts at scale; calibrate graders on noisy, production-like text. • Add an audio audit loop — Spot-check ~1–5% of sessions end-to-end by actually listening. Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Evaluation

Crawl / Walk / Run • Build complexity in steps
— If your system cannot crawl, it will not run. Early evals must be diagnosable, repeatable and cheap. • Crawl — Synthetic (TTS) audio + single-turn. Tests intelligence: intent routing, tool choice, valid arguments. • Walk — Real noisy audio + single-turn. Tests perception: does it still hear "7pm", not "7" or "7:15"? • Run — Simulated user + multi-turn. Tests robustness: holds the goal, sequences tools, recovers from errors. • Two independent axes — Right = more realistic audio. Up = more realistic interaction. Start bottom-left, raise one axis at a time. • Example: "Change my reservation to 7pm" — Crawl grades the next turn only; Walk replays it under phone-bandwidth noise; Run adds messy follow-ups + an injected tool error. • Single-turn vs multi-turn — Single-turn = can you win the battle. Multi-turn = can you win the war (episode-level outcome). • Top-right is manual — Real audio + full multi-turn = run end-to-end sessions yourself, for the whole project. Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Evaluation

The loop: does a change actually help? • Dataset —
Start with a gold seed set (10–50) of must-not-fail flows. Balance positives and negatives; tag by intent, audio, language, expected tool. • Graders — Layer them: deterministic (tool calls, JSON, patterns), LLM rubric (correctness, helpfulness), audio (silence, overlap, interruptions). • Harness — One job: make runs comparable. Pin audio bytes, chunking and VAD; prefer VAD off + manual commit for reproducibility. • Manual review is highest-leverage — Automation shows what you can measure; listening shows what you should measure. • The iteration loop — Run evals, localise the failure to one behaviour, change ONE thing, re- run, confirm the fix improved it without regressions. • Regression suite — Hard cases you already fixed. Run on every prompt, model or tool change. Your "do not break" contract. • Rolling discovery set — Fresh failures from production. Promote real failure modes into the offline dataset over time. • Holdout set — Untouched subset run occasionally. If test scores climb while holdout stays flat, you are training for the test. Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Evaluation

• Realtime Voice ≠ Chatbot with microfone: It is a
complete new interaction paradigm • Speech-to-Speech is the way for low-latency hands-free scenarios • UX is the real challenge – clear and precise communication of state to the human user is key • Cost is controllable – with caching, mini-models, truncation and short sessions • Start small – one scenario, one tool, one demo • And MOST IMPORTANT: The implementation needs to provide a REAL benefit for the user Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Summary

Sascha Lehmann [email protected] Thank you!

Chatless-AI – Realtime-Sprachinterfaces für Web...

Chatless-AI – Realtime-Sprachinterfaces für Web und Mobile entwickeln

Sascha Lehmann

More Decks by Sascha Lehmann

Featured

Transcript

Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web &

Consultant @ Thinktecture AG Chatless AI Architecting Realtime, Hands-Free AI

Speech interaction until now • Each arrow represents 200-500ms of

With realtime model • Permanent available audio channel • It

Property OpenAI gpt-realtime OpenAI gpt-realtime-mini Gemini Live 2.5 Flash Architecture

High Level • Model processes audio directly (lowest latency) •

Transport – WebRTC vs. WebSocket Aspects WebRTC WebSocket Protocol UDP

Beispielanwendung animieren <Demo/>

Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web &

Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web &

Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web &

Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web &

• The user has no visual clue • Voice needs

• Unique audio cue sound • Vocal feedback from the

• Automatic retry • Agent asks for clarification • Control

• User needs the possibility to cancel an action •

• Voice is not the only interface – it is

Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web &

• ~32 Token/Sek (Input) • ~20-25 Token/Sek (Output) • Cost

1. Make use of prompt caching 2. Make agent not

• Average duration of a pocket depth measurement: ~ 3

Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web &

Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web &

Best Practices • Role & Objective — Clear identity +

Key Techniques • Bullet Points > Prose — Model follows

• No API-KEY in CLIENT! → Generate Ephemeral Token via

How do I know a prompt change actually made things

Why it is hard • Two axes, not one —

Crawl / Walk / Run • Build complexity in steps

The loop: does a change actually help? • Dataset —

• Realtime Voice ≠ Chatbot with microfone: It is a

Sascha Lehmann [email protected] Thank you!