Upgrade to Pro — share decks privately, control downloads, hide ads and more …

“Hello, AI?!” - Real-time interactions with lan...

“Hello, AI?!” - Real-time interactions with language models

Large Language Models (LLMs) have changed how we design software: Instead of clicks and GUIs, natural language now dominates— not just via keyboard and text, but also by voice. In this session, you will learn how to integrate voice-enabled AI models directly into your application and control them in real time with your voice. Thanks to Gemini Live and tool calling, new possibilities for natural language interfaces are emerging, with minimal latency, bidirectional, and multilingual.

In practical demos, Christian Liebel from Thinktecture will show you how to address selected LLMs by voice, link your functionalities to them, and develop smart, conversational interfaces. This is not science fiction.

Caution: Interactive voice AIs can be addictive.

Avatar for Christian Liebel

Christian Liebel

July 11, 2025
Tweet

More Decks by Christian Liebel

Other Decks in Programming

Transcript

  1. Google Developer Expert Angular & Web W3C WebML CG &

    WG X: @christianliebel Bluesky: @christianliebel.com Email: [email protected] Hello, it’s me. “Hello, AI!?“ Real-time interactions with language models Christian Liebel
  2. Overview “Hello, AI!?“ Real-time interactions with language models Generative AI

    Text OpenAI GPT Mistral … Audio/Music Musico Soundraw … Images DALL·E Firefly … Video Sora Runway … Speech Whisper tortoise-tts …
  3. Overview “Hello, AI!?“ Real-time interactions with language models Generative AI

    Text OpenAI GPT Mistral … Audio/Music Musico Soundraw … Images DALL·E Firefly … Video Sora Runway … Speech Whisper tortoise-tts …
  4. – Process speech input and output natively (transcription optional) –

    Multiple languages and output voices are supported – Tool/function calling are supported – Voice Activity Detection (VAD) activated automatically (model waits for a period of silence before responding) – Model can be interrupted – Use cases: Phone agents, ticket machines, alternative input methods for accessibility and other speech-based user experiences “Hello, AI!?“ Real-time interactions with language models Realtime Models
  5. Gemini Live API (Preview) Half-cascade (better for tools) – Gemini

    Live 2.5 Flash – Gemini 2.0 Flash Live 001 Native audio dialog (reasoning) – Gemini 2.5 Flash OpenAI Realtime API (Beta) – GPT-4o Realtime – GPT-4o mini Realtime “Hello, AI!?“ Real-time interactions with language models Realtime Models
  6. Gemini Live API (Preview) – 40+ lanugages – Supports speech,

    text and video input – Supports speech and text output – Supports WebSockets – No JS SDK yet, integration is ~1300 LOC OpenAI Realtime API (Beta) – 57+ languages – Supports speech and text input – Supports speech and text output – Supports WebRTC and WebSockets – No JS SDK yet, WebRTC integration is ~50 LOC “Hello, AI!?“ Real-time interactions with language models APIs
  7. – Bi-directional communication protocol based on TCP – Reduces overhead

    by eliminating repeated HTTP headers “Hello, AI!?“ Real-time interactions with language models WebSockets https://ai.google.dev/gemini-api/docs/live
  8. getUserMedia() – JavaScript APIs for accessing media devices – Captures

    video and/or audio input – W3C Candidate Recommendation – Supported by all major browsers for several years (Chrome 21, Edge 12, Safari 11, Firefox 17) “Hello, AI!?“ Real-time interactions with language models Media Capture & Streams API
  9. “Hello, AI!?“ Real-time interactions with language models Gemini Live Messages

    Client Server setup model, system message, modalities, voice configuration, tools realtimeInput setupComplete
  10. – Realtime models unlock new, exciting opportunities for natural language

    interfaces beyond chat boxes – Bidirectional, multilingual, minimum latency – All available models (OpenAI Realtime/Gemini Live) in beta or preview – Quality is good, but not perfect – Pricing seems quite high – Fun! – No science fiction, try it today! “Hello, AI!?“ Real-time interactions with language models Summary