David Nicholson, Ph.D.
Sober lab, Emory University
nicholdav nickledave
Neural networks for
segmentation of
vocalizations
Yarden Cohen, Ph.D.
Gardner lab, Boston University
YardenJCohen yardencsgithub
Slide 2
Slide 2 text
Introduction
Automatic Speech Recognition (ASR) / Speech to
Text
Potter et al. 1947, Visible Speech
Slide 3
Slide 3 text
Introduction
State of the art: LSTM networks
https://awni.github.io/speech-recognition/
Slide 4
Slide 4 text
Introduction
State of the art: LSTM networks
◦ Specifically bidirectional LSTMs
https://awni.github.io/speech-recognition/
Slide 5
Slide 5 text
Introduction
The problem: segmentation
◦ State-of-the-art networks explicitly avoid segmentation
Graves et al. 2005
Slide 6
Slide 6 text
Introduction
The problem: segmentation
◦ State-of-the-art networks explicitly avoid segmentation
Graves et al. 2005
Slide 7
Slide 7 text
Introduction
The problem: segmentation
◦ State-of-the-art networks explicitly avoid segmentation
Graves et al. 2005
Slide 8
Slide 8 text
Introduction
The problem: segmentation
◦ State-of-the-art networks explicitly avoid segmentation
Graves et al. 2005
Slide 9
Slide 9 text
Introduction
But there are many cases where we care about segmentation
◦ Diagnosis of speech disorders, e.g., stuttering
◦ Understand how brain controls speech
◦ At the level of muscles
◦ At the level of phonemes
◦ As a feature that may make ASR more robust
◦ Background noise
◦ Different accents
Büchel C, Sommer M (2004) What Causes Stuttering?. PLOS Biology
Slide 10
Slide 10 text
Introduction
Big picture: segmentation = finding signal in noise
◦ Object recognition
◦ Image segmentation
◦ Finding piano keys in music (S. Böck and M. Schedl 2012)
◦ Finding events embedded in background noise (Parascandolo, Huttunen, and Virtanen 2016)
◦ Finding elements of birdsong (Koumura, and Okanoya 2016)
Slide 11
Slide 11 text
Introduction
Birdsong: an MNIST for benchmarking segmentation architectures?
◦ Bengalese Finch song repository:
https://figshare.com/articles/Bengalese_Finch_song_repository/4805749
Slide 12
Slide 12 text
Introduction
Songbirds: a model system for understanding how
the brain learns speech and similar motor skills
◦ learn their vocalizations by social interaction, from a
tutor
http://songbirdscience.com/
Photo: Jon Sakata. Spectrogram: Dooling lab
Slide 13
Slide 13 text
Introduction
Segmenting birdsong
◦ Songbird “syllables”
◦ like phonemes in human speech
Slide 14
Slide 14 text
Introduction
Segmenting birdsong
◦ Songbird “syllables”
◦ like phonemes in human speech
◦ In many species, song syllables separated by brief silences
Slide 15
Slide 15 text
Introduction
Segmenting birdsong
◦ Songbird “syllables”
◦ like phonemes in human speech
◦ In many species, song syllables separated by brief silences
◦ easier to segment than speech, a good test case
Slide 16
Slide 16 text
Introduction
Segmenting birdsong
◦ Songs can vary between species as much as
phonemes and syntax vary between
languages Zebra finch:
1 Sec
Slide 17
Slide 17 text
Introduction
Segmenting birdsong
◦ Songs can vary between species as much as
phonemes and syntax vary between
languages Zebra finch:
1 Sec
1 Sec
Bengalese finch:
Slide 18
Slide 18 text
Introduction
Segmenting birdsong
◦ Songs can vary between species as much as
phonemes and syntax vary between
languages Zebra finch:
1 Sec
Canaries:
1 Sec
Slide 19
Slide 19 text
Introduction
Segmenting birdsong
◦ Songs can vary between species as much as
phonemes and syntax vary between
languages
◦ Allows us to test how well models
generalize across different “languages”
Zebra finch:
1 Sec
Canaries:
1 Sec
Slide 20
Slide 20 text
Introduction
Why do we care about segmenting birdsong?
◦ Use case: behavioral experiments
Slide 21
Slide 21 text
Introduction
Why do we care about segmenting birdsong?
◦ Use case: behavioral experiments
Slide 22
Slide 22 text
Introduction
Why do we care about
segmenting birdsong?
◦ Use case: relating neural
activity to song syllables
Slide 23
Slide 23 text
Introduction
Why do we care about
segmenting birdsong?
◦ Use case: relating neural
activity to song syllables
Slide 24
Slide 24 text
Introduction
Why do we care about
segmenting birdsong?
◦ Use case: relating neural
activity to song syllables
Slide 25
Slide 25 text
Introduction
The problem: segmentation
◦ Supervised learning methods can accurately classify segments*
◦ But fail when:
◦ Segmenting fails due to
◦ Noise
◦ Change in song because of experiment
◦ Bird has a song that is not easily segmented
◦ Canary song
*hybrid-vocal-classifier.readthedocs.io
Slide 26
Slide 26 text
Introduction
The problem: segmentation
◦ Supervised learning methods can accurately classify segments*
◦ But fail when:
◦ Segmenting fails due to
◦ Noise
◦ Change in song because of experiment
◦ Bird has a song that is not easily segmented
◦ Canary song
The solution? Neural networks, of course
*hybrid-vocal-classifier.readthedocs.io
Slide 27
Slide 27 text
Methods
But which neural network?
◦ Compare neural networks
◦ Requirements:
◦ Find edges of syllables (segment)
◦ Label syllables (annotate)
◦ Training data
◦ Use spectrograms as input
◦ Annotate all time steps
Methods
Neural network types
◦ Fully convolutional network: FCN
◦ input is spectrogram
◦ convolutional layers + max pooling layers
Slide 30
Slide 30 text
Methods
Neural network types
◦ Fully convolutional network: FCN
◦ input is spectrogram
◦ convolutional layers + max pooling layers
◦ output is spectrogram with every point labeled
Long Shelhamer Darrell 2015
Slide 31
Slide 31 text
Methods
Neural network types
◦ Fully convolutional network: FCN
◦ input is spectrogram
◦ convolutional layers + max pooling layers
◦ output is spectrogram with every point labeled
Koumura Okanoya 2016
Methods
To compare:
◦ “Learning curve”: train with training sets of increasing size
◦ What’s the best we can do with the least amount of data
◦ Replicate for each size with random grab of song files
◦ Measures: accuracy
◦ Framewise accuracy
Slide 37
Slide 37 text
Results
Learning curve
Slide 38
Slide 38 text
Results
CNN-biLSTM error can improve with some changes in hyperparameters
400 second training set
152 513 frequency bins
128 512 hidden units
Bengalese finch test set error histograms
(30 training files, 400s, ~900 test files)
Slide 39
Slide 39 text
Results
CNN-biLSTM can achieve almost zero error on a very stereotyped song, as would be expected
Zebra finch test set error histograms
(14 training files, ~20s, ~150 test files)
Slide 40
Slide 40 text
Discussion
Initial results suggest CNN-biLSTM outperforms FCN for segmentation
Accuracy of CNN-biLSTM can improve further by changing hyperparameters
CNN-biLSTM performs near perfect when segmenting syllables from a species with a very
stereotyped song
Slide 41
Slide 41 text
Discussion
Future work
◦ Further study feedforward vs. recurrent
◦ Consider other architectures developed
◦ “sliding window” convolutional network
◦ Koumura Okanoya 2016: https://github.com/cycentum/birdsong-recognition
◦ similar architecture in Keras: https://github.com/kylerbrown/deepchirp
◦ Applications
◦ Automated segmentation of song
◦ Speech disorder diagnosis, improved ASR
Slide 42
Slide 42 text
Acknowledgments
Gardner and Sober labs
◦ http://people.bu.edu/timothyg/Home.html
◦ https://scholarblogs.emory.edu/soberlab/
Funding sources
◦ NIH
◦ NSF
◦ NVIDIA GPU grant program
Fork us on Github
https://github.com/yardencsGitHub/tf_syllable_segmentation_annotation
https://github.com/NickleDave/tf_syllable_segmentation_annotation
https://github.com/NickleDave/fcn-syl-seg