Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chatless-AI – Realtime-Sprachinterfaces für Web...

Avatar for Sascha Lehmann Sascha Lehmann
June 30, 2026
5

Chatless-AI – Realtime-Sprachinterfaces für Web und Mobile entwickeln

Mit Realtime-Sprachmodellen wie GPT-realtime oder Gemini-Live entsteht eine neue Generation von Interfaces: Sprache wird zum sofort reagierenden, latenzarmen Interaktionskanal – ohne Prompting, ohne Wartezeiten, hands-free.
In diesem Talk zeigt Sascha Lehmann, wie Realtime-Modelle technisch funktionieren, wie man Kontextgrenzen, Rollen und Sicherheit zuverlässig kontrolliert und wie sich Realtime-AI gezielt in Web- und Mobile-Anwendungen integrieren lässt – von Architektur über Kostenoptimierung bis hin zur UX, die Nutzer transparent durch den Dialog führt.

Avatar for Sascha Lehmann

Sascha Lehmann

June 30, 2026

More Decks by Sascha Lehmann

Transcript

  1. Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web &

    Mobile Sascha Lehmann @derLehmann_S Consultant
  2. Consultant @ Thinktecture AG Chatless AI Architecting Realtime, Hands-Free AI

    Interfaces for Web & Mobile Sascha Lehmann @derLehmann_S https://www.linkedin.com/in/sascha-lehmann [email protected] https://www.thinktecture.com/thinktects/sascha-lehmann/    
  3. Speech interaction until now • Each arrow represents 200-500ms of

    latency • Like a zoom call with huge delay Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile What ”real-time” really means Mic STT (Speech to Text) LLM TTS (Text to Speech) Speaker
  4. With realtime model • Permanent available audio channel • It

    no longer feels like software anymore, but like a real counterpart • Average response time 320ms Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile What ”real-time” really means Mic STT (Speech to Text) Realtime Model TTS (Text to Speech) Speaker
  5. Property OpenAI gpt-realtime OpenAI gpt-realtime-mini Gemini Live 2.5 Flash Architecture

    Speech-to-Speech native Speech-to-Speech native Multimodal Live (Audio/Video/Text) Latency ~320ms (first response) ~320ms Sub-800ms​ Languages Multilingual, language change mid-sentence Multilingual 24 languages Particular strengths Best adherence to instructions, Tool Calling Cost efficient, quick Video-Input, Multimodal, costefficient Barge-In support Emotional adaption Configurable via prompt restricted Affective Dialog native Voices 10+ voices Like gpt-realtime 30+ HD Voices Latest models gpt-realtime-1.5 (Feb 2026) gpt-realtime-mini-2025-12-15 gemini-live-2.5-flash-native-audio (GA) Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Available Realtime Models
  6. High Level • Model processes audio directly (lowest latency) •

    Thinks and responds in speech • Doesn’t rely on transcript • Hears emotion and intent • Filters noise Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Architecture https://developers.openai.com/api/docs/guides/voice-agents/
  7. Transport – WebRTC vs. WebSocket Aspects WebRTC WebSocket Protocol UDP

    (packet loss tolerant) TCP (guaranteed delivery) Latency Very low (~50–150ms) Higher (TCP Head-of-Line Blocking) Features Echo Cancellation, Noise Suppression, Jitter Buffer - Ideal for Browser-Apps, Client-side Server-side, Phone-integration Downsides NAT Traversal complexity Latency peeks during packet loss Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Architecture https://developers.openai.com/api/docs/guides/voice-agents/
  8. Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web &

    Mobile Realtime Connection App Backend RealtimeAPI 1 2 3 4 RealtimeAPI-Key System Prompt Ephemeral Key WebRTC
  9. • The user has no visual clue • Voice needs

    a state indicator • Other methods • Audio queues • Sound effects Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile 1. What is my current state?
  10. • Unique audio queue sound • Vocal feedback from the

    agent • Visual indicator update Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile 2. Success / Failure feedback
  11. • Automatic retry • Agent asks for clarification • Control

    to restart the voice agent session Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile 3. Error Handling
  12. • User needs the possibility to cancel an action •

    Voice activity detection (VAD) • Also include manual cancellation Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile 4. Interruption (Barge-In)
  13. • Voice is not the only interface – it is

    an additional one • Realtime models are also ”Multimodal Models” • Speech • Text • Images • Gemini Live also supports Live Camera Input - So the model can see what you currently see Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Realtime Models do not only support speech
  14. • ~32 Token/Sek (Input) • ~20-25 Token/Sek (Output) • Cost

    increase with conversation length • Each turn contains existing context. Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Pricing model of each model Modell Input (Audio) Cached Input Output (Audio) ~cost per minute (conversation) gpt-realtime $32 / 1M Tokens $0.40 / 1M Tokens $64 / 1M Tokens ~$0.06–0.12 gpt-realtime- mini $10 / 1M Tokens $0.30 / 1M Tokens $20 / 1M Tokens ~$0.02–0.04​ Gemini Live 2.5 Flash $3 / 1M Tokens (Audio) Caching verfügbar $12 / 1M Tokens ~$0.01–0.02​
  15. 1. Make use of prompt caching 2. Make agent not

    to talkative 3. Truncation & Retention Ration konfigurieren 4. Manage Conversation history manually 5. Make use of mini model 6. Proactive Tool calling without confirmation 7. Design clear conversation flows with a final end Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile How to keep the costs under control
  16. • Average duration of a pocket depth measurement: ~ 3

    min • With gpt-realtime-mini + caching: ~$0.04–0.08 per finding • With Gemini Live Flash: ~$0.03–0.06 per finding Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Example calculation
  17. Best Practices • Role & Objective — Clear identity +

    task definition ("You are a dentist assistant, your task is form-filling") • Personality & Tone — Language, formality, response length, pacing explicitly set (e.g. "max 2 sentences", formal address, "professional & concise") • Reference Pronunciations — Phonetic guides for domain terms + tooth numbers that TTS would mispronounce ("one-eight" not "eighteen") • Tools / Function Calling — Per tool: trigger condition, preamble ("One moment..."), error handling (1x silent retry, 2x notify user, 3x fall back to manual) • Instructions / Rules — Bullet points over prose, CAPS for critical rules ("NEVER GUESS", "ALWAYS use get_selected_tooth first") • Conversation Flow — State machine for procedures (Pocket Depth: Init → Measurement Loop → Completion) • Safety & Escalation — Domain constraint, exact refusal scripts, escalation after 3x off-topic • Examples — Concrete input/output pairs for ambiguous cases Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile System Prompt
  18. Key Techniques • Bullet Points > Prose — Model follows

    short bullets better than paragraphs • Labeled Sections — Model navigates by heading; each section = one concern • Lock the Language — "Respond ONLY in German" stated explicitly, prevents drift • Unclear Audio Handling (3-Step) — 1. Ask to repeat → 2. Address audio quality → 3. Offer to skip • Rotating Confirmation Phrases — Provide a list of alternatives ("Understood", "Noted", "Alright") → prevents robotic repetition • Tool Preambles — Short sentence BEFORE the tool call so the user knows what's happening • Confirm Critical Data — For medical values always read back: "Tooth one-eight: three. Correct?" • Strict Domain Constraint — Exact scripted response for out-of-scope requests + escalation threshold Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile System Prompt
  19. • No API-KEY in CLIENT! → Generate Ephemeral Token via

    Server • Clear definitions of what is allowed and what is not – System Prompt • Escalation-Tool: If there is uncertainty → Abort or handoff to a human agent • DSGVO: Check in which country the model is running Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Security & Guardrails
  20. How do I know a prompt change actually made things

    better? Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Evaluation
  21. Why it is hard • Two axes, not one —

    A response can be right and still sound broken. Grade both separately. • Content quality — Did it do the right thing? Correctness, tool choice, tool arguments, instruction following. • Audio quality — Did it sound acceptable? Naturalness, pacing, pronunciation, behaviour under noise. • A turn is a pipeline — speech start/stop, commit, response, audio deltas. Log every stage to find the real root cause. • Transcript is not ground truth — The audio signal is the truth; a transcript is just a model’s interpretation and can be wrong. • False fails & false passes — ASR drops a digit the model heard correctly, or a clean transcript hides clipped audio the model guessed at. • Grade on transcripts + traces — Run most automated grading on transcripts at scale; calibrate graders on noisy, production-like text. • Add an audio audit loop — Spot-check ~1–5% of sessions end-to-end by actually listening. Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Evaluation
  22. Crawl / Walk / Run • Build complexity in steps

    — If your system cannot crawl, it will not run. Early evals must be diagnosable, repeatable and cheap. • Crawl — Synthetic (TTS) audio + single-turn. Tests intelligence: intent routing, tool choice, valid arguments. • Walk — Real noisy audio + single-turn. Tests perception: does it still hear "7pm", not "7" or "7:15"? • Run — Simulated user + multi-turn. Tests robustness: holds the goal, sequences tools, recovers from errors. • Example: "Change my reservation to 7pm" — Crawl grades the next turn only; Walk replays it under phone-bandwidth noise; Run adds messy follow-ups + an injected tool error. • Single-turn vs multi-turn — Single-turn = can you win the battle. Multi-turn = can you win the war (episode-level outcome). Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Evaluation
  23. The loop: does a change actually help? • Dataset —

    Start with a gold seed set (10–50) of must-not-fail flows. Balance positives and negatives; tag by intent, audio, language, expected tool. • Graders — Layer them: deterministic (tool calls, JSON, patterns), LLM rubric (correctness, helpfulness), audio (silence, overlap, interruptions). • Harness — One job: make runs comparable. Pin audio bytes, chunking and VAD; prefer VAD off + manual commit for reproducibility. • Manual review is highest-leverage — Automation shows what you can measure; listening shows what you should measure. • The iteration loop — Run evals, localise the failure to one behaviour, change ONE thing, re- run, confirm the fix improved it without regressions. • Regression suite — Hard cases you already fixed. Run on every prompt, model or tool change. Your "do not break" contract. • Rolling discovery set — Fresh failures from production. Promote real failure modes into the offline dataset over time. • Holdout set — Untouched subset run occasionally. If test scores climb while holdout stays flat, you are training for the test. Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Evaluation
  24. • Realtime Voice ≠ Chatbot with microfone: It is a

    complete new interaction paradigm • Speech-to-Speech is the way for low-latency hands-free scenarios • UX is the real challenge – clear and precise communication of state to the human user is key • Cost is controllable – with caching, mini-models, truncation and short sessions • Start small – one scenario, one tool, one demo • And MOST IMPORTANT: The implementation needs to provide a REAL benefit for the user Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Summary