Slide 1

Slide 1 text

“Hello, AI?!” Real-time interactions with language models Christian Liebel @christianliebel Consultant

Slide 2

Slide 2 text

Hello, it’s me. “Hello, AI!?“ Real-time interactions with language models Christian Liebel X: @christianliebel Bluesky: @christianliebel.com Email: christian.liebel @thinktecture.com Angular, PWA & Generative AI Slides: thinktecture.com /christian-liebel

Slide 3

Slide 3 text

Overview “Hello, AI!?“ Real-time interactions with language models Generative AI Text OpenAI GPT Mistral … Audio/Music Musico Soundraw … Images DALL·E Firefly … Video Sora Runway … Speech Whisper tortoise-tts …

Slide 4

Slide 4 text

Overview “Hello, AI!?“ Real-time interactions with language models Generative AI Text OpenAI GPT Mistral … Audio/Music Musico Soundraw … Images DALL·E Firefly … Video Sora Runway … Speech Whisper tortoise-tts …

Slide 5

Slide 5 text

“Hello, AI!?“ Real-time interactions with language models Large Language Models

Slide 6

Slide 6 text

“Hello, AI!?“ Real-time interactions with language models DEMO

Slide 7

Slide 7 text

“Hello, AI!?“ Real-time interactions with language models Multimodal Models DEMO

Slide 8

Slide 8 text

“Hello, AI!?“ Real-time interactions with language models DEMO

Slide 9

Slide 9 text

“Hello, AI!?“ Real-time interactions with language models Multimodal Realtime Models DEMO

Slide 10

Slide 10 text

“Hello, AI!?“ Real-time interactions with language models DEMO

Slide 11

Slide 11 text

– Process speech input and output natively (transcription optional) – Multiple languages and output voices are supported – Tool/function calling are supported – Voice Activity Detection (VAD) activated automatically (model waits for a period of silence before responding) – Model can be interrupted “Hello, AI!?“ Real-time interactions with language models Realtime Models

Slide 12

Slide 12 text

Use Cases – Natural language interfaces – Smart form filling – Navigation – Voice assistants – Phone agents – Alternative input methods for accessibility (e.g., ticket machines) “Hello, AI!?“ Real-time interactions with language models Realtime Models

Slide 13

Slide 13 text

OpenAI Realtime API – gpt-realtime – gpt-mini-realtime Gemini Live API Half-cascade (better for tools) – Gemini Live 2.5 Flash – Gemini 2.0 Flash Live 001 Native audio dialog (reasoning) – Gemini 2.5 Flash “Hello, AI!?“ Real-time interactions with language models Realtime Models

Slide 14

Slide 14 text

OpenAI Realtime API – 57+ languages – Supports speech, text and image input – Supports speech and text output – Supports WebRTC and WebSockets – Agents SDK in TS, WebRTC integration is ~50 LOC Gemini Live API – 40+ languages – Supports speech, text and video input – Supports speech and text output – Supports WebSockets – No JS SDK yet, integration is ~1300 LOC “Hello, AI!?“ Real-time interactions with language models APIs

Slide 15

Slide 15 text

Web Real-Time Communication – JavaScript API for real-time audio/video communication – Supports data channels for data transfer – Used by Google Meet, Microsoft Teams (web), … – W3C Recommendation (web standard) – Supported by all major browsers for several years (Chrome 27, Edge 15, Safari 11, Firefox 22) https://webrtc.org/ “Hello, AI!?“ Real-time interactions with language models WebRTC

Slide 16

Slide 16 text

getUserMedia() – JavaScript APIs for accessing media devices – Captures video and/or audio input – W3C Candidate Recommendation – Supported by all major browsers for several years (Chrome 21, Edge 12, Safari 11, Firefox 17) “Hello, AI!?“ Real-time interactions with language models Media Capture & Streams API

Slide 17

Slide 17 text

OpenAI Realtime API // Create a peer connection const pc = new RTCPeerConnection(); // Set up to play remote audio from the model const audioEl = document.createElement("audio"); audioEl.autoplay = true; pc.ontrack = e => audioEl.srcObject = e.streams[0]; // Add local audio track for microphone input in the browser const ms = await navigator.mediaDevices.getUserMedia({ audio: true }); pc.addTrack(ms.getTracks()[0]); “Hello, AI!?“ Real-time interactions with language models Code Example (1/3)

Slide 18

Slide 18 text

OpenAI Realtime API // Set up data channel for sending and receiving events const dc = pc.createDataChannel("oai-events"); dc.addEventListener ("message", (e) => { // Realtime server events appear here! console.log(e); }); “Hello, AI!?“ Real-time interactions with language models Code Example (2/3)

Slide 19

Slide 19 text

OpenAI Realtime API const baseUrl = "https://api.openai.com/v1/realtime/calls"; const model = "gpt-realtime"; const sdpResponse = await fetch(`${baseUrl}?model=${model}`), { method: "POST", body: offer.sdp, headers: { Authorization: `Bearer ${EPHEMERAL_KEY}`, "Content-Type": "application/sdp", }, }); const answer = { type: "answer", sdp: await sdpResponse.text() }; await pc.setRemoteDescription(answer); “Hello, AI!?“ Real-time interactions with language models Code Example (3/3)

Slide 20

Slide 20 text

“Hello, AI!?“ Real-time interactions with language models OpenAI Realtime Console DEMO

Slide 21

Slide 21 text

Session “Hello, AI!?“ Real-time interactions with language models OpenAI Realtime API https://platform.openai.com/docs/guides/realtime-conversations#realtime-speech-to-speech-sessions

Slide 22

Slide 22 text

Session events “Hello, AI!?“ Real-time interactions with language models OpenAI Realtime API Client Server Session initialized with default values. session.created session.update session.updated Update session voice, modalities, tools, turn detection. Session updated.

Slide 23

Slide 23 text

Session events “Hello, AI!?“ Real-time interactions with language models OpenAI Realtime API https://platform.openai.com/docs/api-reference/realtime-client-events/session/update (28.10.2025)

Slide 24

Slide 24 text

Input audio buffer events (selection) “Hello, AI!?“ Real-time interactions with language models OpenAI Realtime API Client Server Server has detected speech. input_audio_buffer.speech_started input_audio_buffer.committed Server has detected end of speech. Server has committed input buffer and will create conversation item. input_audio_buffer.speech_stopped

Slide 25

Slide 25 text

Conversation events “Hello, AI!?“ Real-time interactions with language models OpenAI Realtime API Client Server conversation.item.create conversation.item.created Create a conversation item programmatically (e.g., from text input). Input audio buffer has been committed, client has sent a conversation item, or the server is generating a response.

Slide 26

Slide 26 text

No longer request/response But publish/subscribe “Hello, AI!?“ Real-time interactions with language models Implications on Architecture Client Server Client Server

Slide 27

Slide 27 text

Let’s change the world Tools/Function calling can be used to... – extend the model’s knowledge by accessing custom data (customer data, articles, orders, wikis, postcode API, …) – extend the model’s capabilities by executing real-world actions (navigate, send an SMS, update order status in a database, fill in a form, perform a web search, …) “Hello, AI!?“ Real-time interactions with language models Tool/Function Calling

Slide 28

Slide 28 text

Foundation of Agentic AI “Hello, AI!?“ Real-time interactions with language models Tool/Function Calling https://mcp.so

Slide 29

Slide 29 text

Tool/function calling – OpenAl Realtime API supports adding tools at response (response.create) or session level (session.update) – When processing input, the model determines if it should call one of the present functions – The function must be executed by the client – Once the function has been executed, the client can create a new conversation item with the result of the function call (“return value”) “Hello, AI!?“ Real-time interactions with language models OpenAI Realtime API

Slide 30

Slide 30 text

Tool/function calling events (selection) “Hello, AI!?“ Real-time interactions with language models OpenAI Realtime API Client Server Set available functions. response.done session.update Contains the function call. Provide the result of a function call. conversation.item.create

Slide 31

Slide 31 text

“Hello, AI!?“ Real-time interactions with language models Tool/Function Calling DEMO

Slide 32

Slide 32 text

“Hello, AI!?“ Real-time interactions with language models OpenAI Realtime API https://openai.com/api/pricing/ (28.10.2025)

Slide 33

Slide 33 text

“Hello, AI!?“ Real-time interactions with language models OpenAI Realtime API https://community.openai.com/t/estimate-the-cost-for-1-min-usage-of-real-time-api/1019290/6

Slide 34

Slide 34 text

“Hello, AI!?“ Real-time interactions with language models Local Realtime APIs DEMO

Slide 35

Slide 35 text

“Hello, AI!?“ Real-time interactions with language models Local Realtime APIs Whisper (STT) Silero (VAD) SmolLM2- 1.7B (LLM) Kokoro (TTS)

Slide 36

Slide 36 text

– Realtime models unlock new, exciting opportunities for natural language interfaces beyond chat boxes – Bidirectional, multilingual, minimum latency – Quality is good, but not perfect – Pricing seems quite high – Fun! – No science fiction, try it today! “Hello, AI!?“ Real-time interactions with language models Summary

Slide 37

Slide 37 text

Thank you for your kind attention! Christian Liebel @christianliebel [email protected]