Neural networks for segmentation of vocalizations

David Nicholson, Ph.D. Sober lab, Emory University nicholdav nickledave Neural
networks for segmentation of vocalizations Yarden Cohen, Ph.D. Gardner lab, Boston University YardenJCohen yardencsgithub

Introduction Automatic Speech Recognition (ASR) / Speech to Text Potter
et al. 1947, Visible Speech

Introduction State of the art: LSTM networks https://awni.github.io/speech-recognition/

Introduction State of the art: LSTM networks ◦ Specifically bidirectional
LSTMs https://awni.github.io/speech-recognition/

Introduction The problem: segmentation ◦ State-of-the-art networks explicitly avoid segmentation
Graves et al. 2005

Introduction But there are many cases where we care about
segmentation ◦ Diagnosis of speech disorders, e.g., stuttering ◦ Understand how brain controls speech ◦ At the level of muscles ◦ At the level of phonemes ◦ As a feature that may make ASR more robust ◦ Background noise ◦ Different accents Büchel C, Sommer M (2004) What Causes Stuttering?. PLOS Biology

Introduction Big picture: segmentation = finding signal in noise ◦
Object recognition ◦ Image segmentation ◦ Finding piano keys in music (S. Böck and M. Schedl 2012) ◦ Finding events embedded in background noise (Parascandolo, Huttunen, and Virtanen 2016) ◦ Finding elements of birdsong (Koumura, and Okanoya 2016)

Introduction Birdsong: an MNIST for benchmarking segmentation architectures? ◦ Bengalese
Finch song repository: https://figshare.com/articles/Bengalese_Finch_song_repository/4805749

Introduction Songbirds: a model system for understanding how the brain
learns speech and similar motor skills ◦ learn their vocalizations by social interaction, from a tutor http://songbirdscience.com/ Photo: Jon Sakata. Spectrogram: Dooling lab

Introduction Segmenting birdsong ◦ Songbird “syllables” ◦ like phonemes in
human speech

human speech ◦ In many species, song syllables separated by brief silences

human speech ◦ In many species, song syllables separated by brief silences ◦  easier to segment than speech, a good test case

Introduction Segmenting birdsong ◦ Songs can vary between species as
much as phonemes and syntax vary between languages Zebra finch: 1 Sec

much as phonemes and syntax vary between languages Zebra finch: 1 Sec 1 Sec Bengalese finch:

much as phonemes and syntax vary between languages Zebra finch: 1 Sec Canaries: 1 Sec

much as phonemes and syntax vary between languages ◦ Allows us to test how well models generalize across different “languages” Zebra finch: 1 Sec Canaries: 1 Sec

Introduction Why do we care about segmenting birdsong? ◦ Use
case: behavioral experiments

Introduction Why do we care about segmenting birdsong? ◦ Use
case: relating neural activity to song syllables

Introduction The problem: segmentation ◦ Supervised learning methods can accurately
classify segments* ◦ But fail when: ◦ Segmenting fails due to ◦ Noise ◦ Change in song because of experiment ◦ Bird has a song that is not easily segmented ◦ Canary song *hybrid-vocal-classifier.readthedocs.io

Introduction The problem: segmentation ◦ Supervised learning methods can accurately
classify segments* ◦ But fail when: ◦ Segmenting fails due to ◦ Noise ◦ Change in song because of experiment ◦ Bird has a song that is not easily segmented ◦ Canary song The solution? Neural networks, of course *hybrid-vocal-classifier.readthedocs.io

Methods But which neural network? ◦ Compare neural networks ◦
Requirements: ◦ Find edges of syllables (segment) ◦ Label syllables (annotate) ◦ Training data ◦ Use spectrograms as input ◦ Annotate all time steps

Methods Neural network types ◦ Fully convolutional network: FCN ◦
input is spectrogram

input is spectrogram ◦ convolutional layers + max pooling layers

input is spectrogram ◦ convolutional layers + max pooling layers ◦ output is spectrogram with every point labeled Long Shelhamer Darrell 2015

input is spectrogram ◦ convolutional layers + max pooling layers ◦ output is spectrogram with every point labeled Koumura Okanoya 2016

Methods Neural network types: ◦ Convolutional + bidirectional LSTM: CNN-biLSTM
◦ Input layer is spectrogram

◦ Input layer is spectrogram ◦ CNN + Max. pooling

◦ Input layer is spectrogram ◦ CNN + Max. pooling ◦ Bi-directional LSTM

Methods To compare: ◦ “Learning curve”: train with training sets
of increasing size ◦ What’s the best we can do with the least amount of data ◦ Replicate for each size with random grab of song files ◦ Measures: accuracy ◦ Framewise accuracy

Results Learning curve

Results CNN-biLSTM error can improve with some changes in hyperparameters
400 second training set 152  513 frequency bins 128  512 hidden units Bengalese finch test set error histograms (30 training files, 400s, ~900 test files)

Results CNN-biLSTM can achieve almost zero error on a very
stereotyped song, as would be expected Zebra finch test set error histograms (14 training files, ~20s, ~150 test files)

Discussion Initial results suggest CNN-biLSTM outperforms FCN for segmentation Accuracy
of CNN-biLSTM can improve further by changing hyperparameters CNN-biLSTM performs near perfect when segmenting syllables from a species with a very stereotyped song

Discussion Future work ◦ Further study feedforward vs. recurrent ◦
Consider other architectures developed ◦ “sliding window” convolutional network ◦ Koumura Okanoya 2016: https://github.com/cycentum/birdsong-recognition ◦ similar architecture in Keras: https://github.com/kylerbrown/deepchirp ◦ Applications ◦ Automated segmentation of song ◦ Speech disorder diagnosis, improved ASR

Acknowledgments Gardner and Sober labs ◦ http://people.bu.edu/timothyg/Home.html ◦ https://scholarblogs.emory.edu/soberlab/ Funding
sources ◦ NIH ◦ NSF ◦ NVIDIA GPU grant program Fork us on Github https://github.com/yardencsGitHub/tf_syllable_segmentation_annotation https://github.com/NickleDave/tf_syllable_segmentation_annotation https://github.com/NickleDave/fcn-syl-seg

Neural networks for segmentation of vocalizations

Neural networks for segmentation of vocalizations

More Decks by David Nicholson

Other Decks in How-to & DIY

Featured

Transcript