Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chatless-AI - Realtime-Sprachinterfaces für Web...

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for Sascha Lehmann Sascha Lehmann
March 04, 2026
11

Chatless-AI - Realtime-Sprachinterfaces für Web und Mobile entwickeln

Mit Realtime-Sprachmodellen wie GPT-realtime oder Gemini-Live entsteht eine neue Generation von Interfaces: Sprache wird zum sofort reagierenden, latenzarmen Interaktionskanal – ohne Prompting, ohne Wartezeiten, hands-free.
In diesem Talk zeigt Sascha Lehmann, wie Realtime-Modelle technisch funktionieren, wie man Kontextgrenzen, Rollen und Sicherheit zuverlässig kontrolliert und wie sich Realtime-AI gezielt in Web- und Mobile-Anwendungen integrieren lässt – von Architektur über Kostenoptimierung bis hin zur UX, die Nutzer transparent durch den Dialog führt.

Avatar for Sascha Lehmann

Sascha Lehmann

March 04, 2026
Tweet

More Decks by Sascha Lehmann

Transcript

  1. Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web &

    Mobile Sascha Lehmann @derLehmann_S Consultant
  2. Speech interaction until now • Each arrow represents 200-500ms of

    latency • Like a zoom call with huge delay What ”real-time” really means Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Mic STT (Speech to Text) LLM TTS (Text to Speech) Speaker
  3. With realtime model • Permanent available audio channel • It

    no longer feels like software anymore, but like a real counterpart • Average response time 320ms What ”real-time” really means Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile Mic STT (Speech to Text) Realtime Model TTS (Text to Speech) Speaker
  4. Property OpenAI gpt-realtime OpenAI gpt-realtime-mini Gemini Live 2.5 Flash Architecture

    Speech-to-Speech native Speech-to-Speech native Multimodal Live (Audio/Video/Text) Latency ~320ms (first response) ~320ms Sub-800ms​ Languages Multilingual, language change mid-sentence Multilingual 24 languages Particular strengths Best adherence to instructions, Tool Calling Cost efficient, quick Video-Input, Multimodal, costefficient Barge-In support Emotional adaption Configurable via prompt restricted Affective Dialog native Voices 10+ voices Like gpt-realtime 30+ HD Voices Latest models gpt-realtime-1.5 (Feb 2026) gpt-realtime-mini-2025-12-15 gemini-live-2.5-flash-native-audio (GA) Available Realtime Models Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile
  5. High Level • Model processes audio directly (lowest latency) •

    Thinks and responds in speech • Doesn’t rely on transcript • Hears emotion and intent • Filters noise Architecture https://developers.openai.com/api/docs/guides/voice-agents/ Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile
  6. Transport – WebRTC vs. WebSocket Aspectsj WebRTC WebSocket Protocol UDP

    (packet loss tolerant) TCP (guaranteed delivery) Latency Very low (~50–150ms) Higher (TCP Head-of-Line Blocking) Features Echo Cancellation, Noise Suppression, Jitter Buffer - Ideal for Browser-Apps, Client-side Server-side, Phone-integration Downsides NAT Traversal complexity Latency peeks during packet loss Architecture https://developers.openai.com/api/docs/guides/voice-agents/ Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile
  7. Realtime Connection Chatless AI Architecting Realtime, Hands-Free AI Interfaces for

    Web & Mobile App Backend RealtimeAPI System Prompt Ephemeral Key 1. RealtimeAPI-Key 2. 3. 3. 4. WebRTC
  8. • The user has no visual clue • Voice needs

    a state indicator • Other methods • Audio queues • Sound effects 1. What is my current state? Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile
  9. • Unique audio queue sound • Vocal feedback from the

    agent • Visual indicator update 2. Success / Failure feedback Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile
  10. • Automatic retry • Agent asks for clarification • Control

    to restart the voice agent session 3. Error Handling Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile
  11. • User needs the possibility to cancel an action •

    Voice activity detection (VAD) • Also include manual cancellation 4. Interruption (Barge-In) Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile
  12. • Voice is not the only interface – it is

    an additional one • Realtime models are also ”Multimodal Models” • Speech • Text • Images • Gemini Live also supports Live Camera Input - So the model can see what you currently see Realtime Models do not only support speech Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile
  13. • ~32 Token/Sek (Input) • ~20-25 Token/Sek (Output) • Cost

    increase with conversation length • Each turn contains existing context. Pricing model of each model Modell Input (Audio) Cached Input Output (Audio) ~cost per minute (conversation) gpt-realtime $32 / 1M Tokens $0.40 / 1M Tokens $64 / 1M Tokens ~$0.06–0.12 gpt-realtime- mini $10 / 1M Tokens $0.30 / 1M Tokens $20 / 1M Tokens ~$0.02–0.04​ Gemini Live 2.5 Flash $3 / 1M Tokens (Audio) Caching verfügbar $12 / 1M Tokens ~$0.01–0.02​ Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile
  14. 1. Make use of prompt caching 2. Make agent not

    to talkative 3. Truncation & Retention Ration konfigurieren 4. Manage Conversation history manually 5. Make use of mini model 6. Proactive Tool calling without confirmation 7. Design clear conversation flows with a final end How to keep the costs under control Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile
  15. • Average duration of a pocket depth measurement: ~ 3

    min • With gpt-realtime-mini + caching: ~$0.04–0.08 per finding • With Gemini Live Flash: ~$0.03–0.06 per finding Example calculation Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile
  16. Best Practices • Role & Objective — Clear identity +

    task definition ("You are a dentist assistant, your task is form-filling") • Personality & Tone — Language, formality, response length, pacing explicitly set (e.g. "max 2 sentences", formal address, "professional & concise") • Reference Pronunciations — Phonetic guides for domain terms + tooth numbers that TTS would mispronounce ("one-eight" not "eighteen") • Tools / Function Calling — Per tool: trigger condition, preamble ("One moment..."), error handling (1x silent retry, 2x notify user, 3x fall back to manual) • Instructions / Rules — Bullet points over prose, CAPS for critical rules ("NEVER GUESS", "ALWAYS use get_selected_tooth first") • Conversation Flow — State machine for procedures (Pocket Depth: Init → Measurement Loop → Completion) • Safety & Escalation — Domain constraint, exact refusal scripts, escalation after 3x off-topic • Examples — Concrete input/output pairs for ambiguous cases System Prompt Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile
  17. Key Techniques • Bullet Points > Prose — Model follows

    short bullets better than paragraphs • Labeled Sections — Model navigates by heading; each section = one concern • Lock the Language — "Respond ONLY in German" stated explicitly, prevents drift • Unclear Audio Handling (3-Step) — 1. Ask to repeat → 2. Address audio quality → 3. Offer to skip • Rotating Confirmation Phrases — Provide a list of alternatives ("Understood", "Noted", "Alright") → prevents robotic repetition • Tool Preambles — Short sentence BEFORE the tool call so the user knows what's happening • Confirm Critical Data — For medical values always read back: "Tooth one-eight: three. Correct?" • Strict Domain Constraint — Exact scripted response for out-of-scope requests + escalation threshold System Prompt Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile
  18. Security & Guardrails • No API-KEY in CLIENT! → Generate

    Ephemeral Token via Server • Clear definitions of what is allowed and what is not – System Prompt • Escalation-Tool: If there is uncertainty → Abort or handoff to a human agent • DSGVO: Check in which country the model is running Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile
  19. Summary • Realtime Voice ≠ Chatbot with microfone: It is

    a complete new interaction paradigm • Speech-to-Speech is the way for low-latency hands-free scenarios • UX is the real challenge – clear and precise communication of state to the human user is key • Cost is controllable – with caching, mini-models, truncation and short sessions • Start small – one scenario, one tool, one demo • And MOST IMPORTANT: The implementation needs to provide a REAL benefit for the user Chatless AI Architecting Realtime, Hands-Free AI Interfaces for Web & Mobile