Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Audio-First Voice Development: The good, the bad & the ugly

Audio-First Voice Development: The good, the bad & the ugly

At NPR, our interest in voice-based interfaces is obvious; they're a natural fit for our content, which has always taken an audio-first approach. Yet our expectations of voice platforms' capabilities to serve top-notch audio-first experiences have not always meshed with reality. Most of these platforms were designed primarily as vehicles for Text-to-Speech (TTS) interactions, and the ability to play audio beyond the 120 seconds or so allowed by an SSML tag comes with a distinct set of challenges, ranging from hard platform limitations to implementation quirks, documentation oversights, suboptimal user experiences, and essential features that are still missing. While we've proven that it isn't impossible to build a great audio-first experience, the journey it took to get there was almost entirely uphill.

Join a developer from NPR as she discusses the good, the bad, and the ugly of audio-first development on the Amazon Alexa and Google Assistant platforms. Along the way, she'll share her vision of what an ideal audio-first developer platform might look like.

Nara Kasbergen

July 23, 2019

More Decks by Nara Kasbergen

Other Decks in Technology


  1. & Audio-First Voice Development: The good The bad The ugly

    Nara Kasbergen (@xiehan) | NPR VOICE Summit | Tuesday, July 23
  2. Scenario: Live station streams ("Play NPR") ◎ Individual NPR member

    stations provide live streams ◉ mp3 or aac audio file format, sometimes uses .pls or .m3u ◎ This skill/action helps users find a station & listen to the stream
  3. Scenario: Live station streams ◎ Supports streaming audio ◎ Multiple

    file formats: mp3, aac, pls, m3u… ◎ PlaybackFailed event verdict: Good!
  4. Scenario: Live station streams ◎ AoG doesn't support streaming audio

    ◎ Have to use a separate, non-public-acccess API called Media Actions verdict: Bad!
  5. Scenario: NPR One (continuous play) ◎ Audio in short segments

    (2-3 minutes), mixed with podcasts ◉ mp3 or aac audio file format ◎ Continuous playlist ◎ Users can pause, resume, skip, fast-forward, rewind, mark as interesting, ask what's playing
  6. Scenario: NPR One (continuous play) ◎ Supports seamless autoplay ◎

    Built-in intents for pause, resume, start over, skip ◎ Auto-resumes after TTS ◎ PlaybackFailed event verdict: Good!
  7. Scenario: NPR One (continuous play) ◎ Skill session ends when

    audio playback starts ◎ No built-in intents for fast-forward, rewind, "what's playing?" ◎ No way to do heartbeats verdict: Bad!
  8. Scenario: NPR One (continuous play) ◎ PlaybackNearlyFinished event is misnamed

    ◎ Can only queue up audio in response to PNF event verdict: Ugly!!!
  9. Scenario: NPR One (continuous play) ◎ Supports seamless autoplay* ◎

    Built-in intents for pause, resume, start over ◎ Automatic play controls in the Assistant app including fast-forward, rewind verdict: Good!
  10. Scenario: NPR One (continuous play) ◎ Seamless autoplay requires a

    hack: empty mp3 SSML <audio> (half sec of silence) ◎ Only supports mp3 files ◎ Cannot implement rewind/ fast-forward purely via voice verdict: Bad!
  11. Scenario: NPR One (continuous play) ◎ No "playback failed" event

    ◎ Audio does not auto-resume after an interruption ◎ No way to start playing audio from a specific offset verdict: Ugly!!!
  12. Scenario: Program on-demand (launch TBA) ◎ One hour of a

    popular program played on-demand ◎ Design ask: when the hour-long audio is done playing, have the voice assistant say: "That's all for now. Would you like to listen to your station's live stream?" ???
  13. Scenario: Program on-demand ◎ Cannot have Alexa speak (much less

    ask a question) after a finite audio file is done playing (using the audio player) verdict: Bad!
  14. Scenario: Program on-demand ◎ Can have Google Assistant speak or

    even ask a question after audio is done playing verdict: Good!
  15. The "Promised Land" for audio ◎ Lifecycle events: audio start,

    stop, finished, nearly finished, failed ◎ Built-in intents for all play controls ◎ Audio auto-resumes after TTS ◎ Can speak after playing audio ◎ Heartbeat events for analytics