Upgrade to Pro — share decks privately, control downloads, hide ads and more …

“Hello, AI?!” — Real-time Interactions with La...

“Hello, AI?!” — Real-time Interactions with Language Models

Large Language Models (LLMs) have changed how we design software: Instead of clicks and GUIs, natural language now dominates— not just via keyboard and text, but also by voice. In this session, you will learn how to integrate voice-enabled AI models directly into your application and control them in real time with your voice. Thanks to real-time API, WebRTC, and tool calling, new possibilities for natural language interfaces are emerging, with minimal latency, bidirectional, and multilingual.

In practical demos, Christian Liebel from Thinktecture will show you how to address selected LLMs by voice, link your functionalities to them, and develop smart, conversational interfaces. This is not science fiction.

Caution: Interactive voice AIs can be addictive.

Avatar for Christian Liebel

Christian Liebel PRO

October 28, 2025
Tweet

More Decks by Christian Liebel

Other Decks in Programming

Transcript

  1. Hello, it’s me. “Hello, AI!?“ Real-time interactions with language models

    Christian Liebel X: @christianliebel Bluesky: @christianliebel.com Email: christian.liebel @thinktecture.com Angular, PWA & Generative AI Slides: thinktecture.com /christian-liebel
  2. Overview “Hello, AI!?“ Real-time interactions with language models Generative AI

    Text OpenAI GPT Mistral … Audio/Music Musico Soundraw … Images DALL·E Firefly … Video Sora Runway … Speech Whisper tortoise-tts …
  3. Overview “Hello, AI!?“ Real-time interactions with language models Generative AI

    Text OpenAI GPT Mistral … Audio/Music Musico Soundraw … Images DALL·E Firefly … Video Sora Runway … Speech Whisper tortoise-tts …
  4. – Process speech input and output natively (transcription optional) –

    Multiple languages and output voices are supported – Tool/function calling are supported – Voice Activity Detection (VAD) activated automatically (model waits for a period of silence before responding) – Model can be interrupted “Hello, AI!?“ Real-time interactions with language models Realtime Models
  5. Use Cases – Natural language interfaces – Smart form filling

    – Navigation – Voice assistants – Phone agents – Alternative input methods for accessibility (e.g., ticket machines) “Hello, AI!?“ Real-time interactions with language models Realtime Models
  6. OpenAI Realtime API – gpt-realtime – gpt-mini-realtime Gemini Live API

    Half-cascade (better for tools) – Gemini Live 2.5 Flash – Gemini 2.0 Flash Live 001 Native audio dialog (reasoning) – Gemini 2.5 Flash “Hello, AI!?“ Real-time interactions with language models Realtime Models
  7. OpenAI Realtime API – 57+ languages – Supports speech, text

    and image input – Supports speech and text output – Supports WebRTC and WebSockets – Agents SDK in TS, WebRTC integration is ~50 LOC Gemini Live API – 40+ languages – Supports speech, text and video input – Supports speech and text output – Supports WebSockets – No JS SDK yet, integration is ~1300 LOC “Hello, AI!?“ Real-time interactions with language models APIs
  8. Web Real-Time Communication – JavaScript API for real-time audio/video communication

    – Supports data channels for data transfer – Used by Google Meet, Microsoft Teams (web), … – W3C Recommendation (web standard) – Supported by all major browsers for several years (Chrome 27, Edge 15, Safari 11, Firefox 22) https://webrtc.org/ “Hello, AI!?“ Real-time interactions with language models WebRTC
  9. getUserMedia() – JavaScript APIs for accessing media devices – Captures

    video and/or audio input – W3C Candidate Recommendation – Supported by all major browsers for several years (Chrome 21, Edge 12, Safari 11, Firefox 17) “Hello, AI!?“ Real-time interactions with language models Media Capture & Streams API
  10. OpenAI Realtime API // Create a peer connection const pc

    = new RTCPeerConnection(); // Set up to play remote audio from the model const audioEl = document.createElement("audio"); audioEl.autoplay = true; pc.ontrack = e => audioEl.srcObject = e.streams[0]; // Add local audio track for microphone input in the browser const ms = await navigator.mediaDevices.getUserMedia({ audio: true }); pc.addTrack(ms.getTracks()[0]); “Hello, AI!?“ Real-time interactions with language models Code Example (1/3)
  11. OpenAI Realtime API // Set up data channel for sending

    and receiving events const dc = pc.createDataChannel("oai-events"); dc.addEventListener ("message", (e) => { // Realtime server events appear here! console.log(e); }); “Hello, AI!?“ Real-time interactions with language models Code Example (2/3)
  12. OpenAI Realtime API const baseUrl = "https://api.openai.com/v1/realtime/calls"; const model =

    "gpt-realtime"; const sdpResponse = await fetch(`${baseUrl}?model=${model}`), { method: "POST", body: offer.sdp, headers: { Authorization: `Bearer ${EPHEMERAL_KEY}`, "Content-Type": "application/sdp", }, }); const answer = { type: "answer", sdp: await sdpResponse.text() }; await pc.setRemoteDescription(answer); “Hello, AI!?“ Real-time interactions with language models Code Example (3/3)
  13. Session “Hello, AI!?“ Real-time interactions with language models OpenAI Realtime

    API https://platform.openai.com/docs/guides/realtime-conversations#realtime-speech-to-speech-sessions
  14. Session events “Hello, AI!?“ Real-time interactions with language models OpenAI

    Realtime API Client Server Session initialized with default values. session.created session.update session.updated Update session voice, modalities, tools, turn detection. Session updated.
  15. Session events “Hello, AI!?“ Real-time interactions with language models OpenAI

    Realtime API https://platform.openai.com/docs/api-reference/realtime-client-events/session/update (28.10.2025)
  16. Input audio buffer events (selection) “Hello, AI!?“ Real-time interactions with

    language models OpenAI Realtime API Client Server Server has detected speech. input_audio_buffer.speech_started input_audio_buffer.committed Server has detected end of speech. Server has committed input buffer and will create conversation item. input_audio_buffer.speech_stopped
  17. Conversation events “Hello, AI!?“ Real-time interactions with language models OpenAI

    Realtime API Client Server conversation.item.create conversation.item.created Create a conversation item programmatically (e.g., from text input). Input audio buffer has been committed, client has sent a conversation item, or the server is generating a response.
  18. No longer request/response But publish/subscribe “Hello, AI!?“ Real-time interactions with

    language models Implications on Architecture Client Server Client Server
  19. Let’s change the world Tools/Function calling can be used to...

    – extend the model’s knowledge by accessing custom data (customer data, articles, orders, wikis, postcode API, …) – extend the model’s capabilities by executing real-world actions (navigate, send an SMS, update order status in a database, fill in a form, perform a web search, …) “Hello, AI!?“ Real-time interactions with language models Tool/Function Calling
  20. Tool/function calling – OpenAl Realtime API supports adding tools at

    response (response.create) or session level (session.update) – When processing input, the model determines if it should call one of the present functions – The function must be executed by the client – Once the function has been executed, the client can create a new conversation item with the result of the function call (“return value”) “Hello, AI!?“ Real-time interactions with language models OpenAI Realtime API
  21. Tool/function calling events (selection) “Hello, AI!?“ Real-time interactions with language

    models OpenAI Realtime API Client Server Set available functions. response.done session.update Contains the function call. Provide the result of a function call. conversation.item.create
  22. “Hello, AI!?“ Real-time interactions with language models OpenAI Realtime API

    https://community.openai.com/t/estimate-the-cost-for-1-min-usage-of-real-time-api/1019290/6
  23. “Hello, AI!?“ Real-time interactions with language models Local Realtime APIs

    Whisper (STT) Silero (VAD) SmolLM2- 1.7B (LLM) Kokoro (TTS)
  24. – Realtime models unlock new, exciting opportunities for natural language

    interfaces beyond chat boxes – Bidirectional, multilingual, minimum latency – Quality is good, but not perfect – Pricing seems quite high – Fun! – No science fiction, try it today! “Hello, AI!?“ Real-time interactions with language models Summary