Slide 1

Slide 1 text

Programming Java By Voice Breandan Considine EclipseCon 2016

Slide 2

Slide 2 text

Automatic speech recognition in 2011

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Traditional Automatic Speech Recognition (ASR) • Requires lots of handmade feature engineering • Poor results: >25% WER for HMM architectures

Slide 5

Slide 5 text

Xuedong Huang, James Baker, and Raj Reddy. A historical perspective of speech recognition. Commun. ACM, 57(1):94– 103, January 2014.

Slide 6

Slide 6 text

What happened? • Bigger data • Faster hardware • Smarter algorithms

Slide 7

Slide 7 text

State of the art ASR • < 10% average word error on large datasets • Deep Neural Nets: RNNs, CNNs, RBMs, LSTM • Trained on 1k+ hours of transcribed speech • Takes time (days) and energy (kWh) to train • Difficult to adapt without prior experience

Slide 8

Slide 8 text

FOSS Speech Recognition • Deep learning libraries • C/C++: Caffe, Kaldi • Python: Theano, Caffe • Lua: Torch • Java: dl4j, H2O • Open source datasets • LibriSpeech – 1000 hours of LibriVox audiobooks • Experience is required

Slide 9

Slide 9 text

Let’s think for a moment… • What if speech recognition were perfect? • ASR is just a fancy input method • How can ASR improve user productivity? • What are the user’s expectations? • Behavior is predictable and deterministic • Control interface is simple and intuitive • Recognition is fast and accurate

Slide 10

Slide 10 text

Online Speech Recognition • Google, Nuance, AT&T, WIT.ai/Facebook, IBM Watson curl -X POST \ --header 'Content-Type: audio/x-flac; rate=44100;' \ --data-binary @speech.flac \ 'https://www.google.com/speech-api/v2/ recognize?lang=en-us&key='

Slide 11

Slide 11 text

Why offline? • Latency – many applications need fast local recognition • Mobility – users do not always have an internet connection • Privacy – data is recorded and analyzed completely offline • Flexibility – configurable API, language, vocabulary, grammar

Slide 12

Slide 12 text

Introduction • What techniques do modern ASR systems use? • How do I build a speech recognition application? • Is speech recognition accessible to developers? • What libraries and frameworks exist for speech?

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Maven Dependencies edu.cmu.sphinx sphinx4-core 1.0-SNAPSHOT edu.cmu.sphinx sphinx4-data 1.0-SNAPSHOT

Slide 15

Slide 15 text

• Recording in 16kHz, 16-bit depth, mono, single channel • 16,000 samples per second at 16-bit depth = 32KBps

Slide 16

Slide 16 text

• Typically 13 features per sample (MFCC or PLP) • Contains delta- and delta-delta features

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

Step 1. Acoustic Model • Acoustic model training is very time consuming (months) • Pretrained models are available for many languages config.setAcousticModelPath("resource:");

Slide 23

Slide 23 text

Step 2. Phonetic Model • Mapping phonemes to words • Word error rate increases with size • Pronunciation aided by g2p labeling • CMU Sphinx has tools to generate dictionaries config.setDictionaryPath("resource:.dict");

Slide 24

Slide 24 text

Step 2. Phonetic Model autonomous AO T AA N AH M AH S autonomously AO T AA N OW M AH S L IY autonomy AO T AA N AH M IY autonomy(2) AH T AA N AH M IY autopacific AO T OW P AH S IH F IH K autopart AO T OW P AA R T autoparts AO T OW P AA R T S autopilot AO T OW P AY L AH T

Slide 25

Slide 25 text

3a. Language Model 3b. Grammar Model • Need ~100k sentences • Some tools: • Logios (model generation) • lmtool (CMU Sphinx) • IRSLM • MITLM • Appropriate for transcription, voice typing • More rigid structure • Suitable for commands • Much smaller state space • Competitive with DNN accuracy for small vocabularies • Easy to configure for UX

Slide 26

Slide 26 text

Step 3a. Language Model generally cloudy today with scattered outbreaks of rain and drizzle persistent and heavy at times some dry intervals also with hazy sunshine especially in eastern parts in the morning highest temperatures nine to thirteen Celsius in a light or moderate mainly east south east breeze cloudy damp and misty today with spells of rain and drizzle in most places much of this rain will be light and patchy but heavier rain may develop in the west later

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

Step 3b. Grammar Model • JSpeech Grammar Format = /10/ small | /2/ medium | /1/ large; = /0.5/ red | /0.1/ blue | /0.2/ green; = please (/20/save files |/1/delete files); = /20/ | /5/ ; public command = | | | config.setGrammarPath("resource:.gram");

Slide 30

Slide 30 text

Step 3b: Grammar Format public = | | | ; = hundred [ | | ]; = ( twenty | thirty | forty | fifty | sixty | seventy | eighty | ninety ) []; = ten | eleven | twelve | thirteen | fourteen | fifteen | sixteen | seventeen | eighteen | nineteen; = one | two | three | four | five | six | seven | eight | nine;

Slide 31

Slide 31 text

Configuring Sphinx-4 Configuration config = new Configuration(); config.setAcousticModelPath(AM_PATH); config.setDictionaryPath(DICT_PATH); config.setLanguageModelPath(LM_PATH); config.setGrammarPath(GRAMMAR_PATH); // config.setSampleRate(8000);

Slide 32

Slide 32 text

Live Speech Recognizer LiveSpeechRecognizer recognizer = new LiveSpeechRecognizer(config); recognizer.startRecognition(true); … recognizer.stopRecognition();

Slide 33

Slide 33 text

Live Speech Recognizer while (…) { // This blocks on a recognition result SpeechResult sr = recognizer.getResult(); String h = sr.getHypothesis(); Collection hs = sr.getNbest(3); … }

Slide 34

Slide 34 text

Stream Speech Recognizer StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration); recognizer.startRecognition( new FileInputStream("speech.wav")); SpeechResult result = recognizer.getResult(); recognizer.stopRecognition();

Slide 35

Slide 35 text

Improving recognition accuracy • Using context-dependent cues • Structuring commands to reduce phonetic similarity • Disabling the microphone • Grammar swapping

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

public = override methods | implement methods | delegate methods | generate | surround with | unwrap | comment | …

Slide 38

Slide 38 text

Grammar Swapping public static void swapGrammar(String new) throws PropertyException, InstantiationException, IOException { Linguist l = (Linguist) cm.lookup("flatLinguist"); linguist.deallocate(); cm.setProperty("jsgfGrammar", "oldGram", new); linguist.allocate(); }

Slide 39

Slide 39 text

Step 4. Audio User Interface • Important mechanism for accessibility • Communication via text-to-speech and audio feedback • Short cues prompt an action and announce a result • Provides a familiar feedback mechanism for users • Playing audio usually blocks speech recognition

Slide 40

Slide 40 text

MaryTTS: Initializing maryTTS = new LocalMaryInterface(); Locale systemLocale = Locale.getDefault(); if (maryTTS.getAvailableLocales() .contains(systemLocale)) { voice = Voice.getDefaultVoice(systemLocale); } maryTTS.setLocale(voice.getLocale()); maryTTS.setVoice(voice.getName());

Slide 41

Slide 41 text

MaryTTS: Generating Speech try { AudioInputStream audio = mary.generateAudio(text); AudioPlayer player = new AudioPlayer(audio); player.start(); player.join(); } catch (SynthesisException | InterruptedException e) { … }

Slide 42

Slide 42 text

Resources • CMUSphinx, http://cmusphinx.sourceforge.net/wiki/ • MaryTTS, http://mary.dfki.de/ • JSpeech Grammar Format, http://www.w3.org/TR/jsgf/ • LibriSpeech ASR Corpus http://www.openslr.org/12/ • ARPA format for N-gram backoff (Doug Paul) http://www.speech.sri.com/projects/srilm/manpages/ngram -format.5.html • LM Tool, http://www.speech.cs.cmu.edu/tools/lmtool.html

Slide 43

Slide 43 text

Further Research • Accurate and Compact Large Vocabulary Speech Recognition on Mobile Devices, research.google.com/pubs/archive/41176.pdf • Comparing Open-Source Speech Recognition Toolkits, http://suendermann.com/su/pdf/oasis2014.pdf • Tuning Sphinx to Outperform Google's Speech Recognition API, http://suendermann.com/su/pdf/essv2014.pdf • Deep Neural Networks for Acoustic Modeling in Speech Recognition, research.google.com/pubs/archive/38131.pdf • Deep Speech: Scaling up end-to-end speech recognition, http://arxiv.org/pdf/1412.5567v2.pdf

Slide 44

Slide 44 text

Online Resources • WER progress: https://github.com/syhw/wer_are_we • Kaldi Speech Recognition Library http://kaldi-asr.org/doc/ • J.A.R.V.I.S (API) https://github.com/lkuza2/java-speech-api • OpenEars http://www.politepix.com/openears • PocketSphinx https://github.com/cmusphinx/pocketsphinx • AT&T Speech http://developer.att.com/apis/speech/docs • https://www.chromium.org/developers/how-tos/api-keys

Slide 45

Slide 45 text

Special Thanks https://github.com/breandan/idear • Alexey Kudinkin (@alexeykudinkin) • Yaroslav Lepenkin (@lepenkinya) • CMU Sphinx (@cmuspeechgroup)

Slide 46

Slide 46 text

Evaluate the Sessions Sign in and vote at eclipsecon.org - 1 + 1 0