Slide 1

Slide 1 text

David Nicholson, Ph.D. Sober lab, Emory University nicholdav nickledave Neural networks for segmentation of vocalizations Yarden Cohen, Ph.D. Gardner lab, Boston University YardenJCohen yardencsgithub

Slide 2

Slide 2 text

Introduction Automatic Speech Recognition (ASR) / Speech to Text Potter et al. 1947, Visible Speech

Slide 3

Slide 3 text

Introduction State of the art: LSTM networks https://awni.github.io/speech-recognition/

Slide 4

Slide 4 text

Introduction State of the art: LSTM networks ◦ Specifically bidirectional LSTMs https://awni.github.io/speech-recognition/

Slide 5

Slide 5 text

Introduction The problem: segmentation ◦ State-of-the-art networks explicitly avoid segmentation Graves et al. 2005

Slide 6

Slide 6 text

Introduction The problem: segmentation ◦ State-of-the-art networks explicitly avoid segmentation Graves et al. 2005

Slide 7

Slide 7 text

Introduction The problem: segmentation ◦ State-of-the-art networks explicitly avoid segmentation Graves et al. 2005

Slide 8

Slide 8 text

Introduction The problem: segmentation ◦ State-of-the-art networks explicitly avoid segmentation Graves et al. 2005

Slide 9

Slide 9 text

Introduction But there are many cases where we care about segmentation ◦ Diagnosis of speech disorders, e.g., stuttering ◦ Understand how brain controls speech ◦ At the level of muscles ◦ At the level of phonemes ◦ As a feature that may make ASR more robust ◦ Background noise ◦ Different accents Büchel C, Sommer M (2004) What Causes Stuttering?. PLOS Biology

Slide 10

Slide 10 text

Introduction Big picture: segmentation = finding signal in noise ◦ Object recognition ◦ Image segmentation ◦ Finding piano keys in music (S. Böck and M. Schedl 2012) ◦ Finding events embedded in background noise (Parascandolo, Huttunen, and Virtanen 2016) ◦ Finding elements of birdsong (Koumura, and Okanoya 2016)

Slide 11

Slide 11 text

Introduction Birdsong: an MNIST for benchmarking segmentation architectures? ◦ Bengalese Finch song repository: https://figshare.com/articles/Bengalese_Finch_song_repository/4805749

Slide 12

Slide 12 text

Introduction Songbirds: a model system for understanding how the brain learns speech and similar motor skills ◦ learn their vocalizations by social interaction, from a tutor http://songbirdscience.com/ Photo: Jon Sakata. Spectrogram: Dooling lab

Slide 13

Slide 13 text

Introduction Segmenting birdsong ◦ Songbird “syllables” ◦ like phonemes in human speech

Slide 14

Slide 14 text

Introduction Segmenting birdsong ◦ Songbird “syllables” ◦ like phonemes in human speech ◦ In many species, song syllables separated by brief silences

Slide 15

Slide 15 text

Introduction Segmenting birdsong ◦ Songbird “syllables” ◦ like phonemes in human speech ◦ In many species, song syllables separated by brief silences ◦  easier to segment than speech, a good test case

Slide 16

Slide 16 text

Introduction Segmenting birdsong ◦ Songs can vary between species as much as phonemes and syntax vary between languages Zebra finch: 1 Sec

Slide 17

Slide 17 text

Introduction Segmenting birdsong ◦ Songs can vary between species as much as phonemes and syntax vary between languages Zebra finch: 1 Sec 1 Sec Bengalese finch:

Slide 18

Slide 18 text

Introduction Segmenting birdsong ◦ Songs can vary between species as much as phonemes and syntax vary between languages Zebra finch: 1 Sec Canaries: 1 Sec

Slide 19

Slide 19 text

Introduction Segmenting birdsong ◦ Songs can vary between species as much as phonemes and syntax vary between languages ◦ Allows us to test how well models generalize across different “languages” Zebra finch: 1 Sec Canaries: 1 Sec

Slide 20

Slide 20 text

Introduction Why do we care about segmenting birdsong? ◦ Use case: behavioral experiments

Slide 21

Slide 21 text

Introduction Why do we care about segmenting birdsong? ◦ Use case: behavioral experiments

Slide 22

Slide 22 text

Introduction Why do we care about segmenting birdsong? ◦ Use case: relating neural activity to song syllables

Slide 23

Slide 23 text

Introduction Why do we care about segmenting birdsong? ◦ Use case: relating neural activity to song syllables

Slide 24

Slide 24 text

Introduction Why do we care about segmenting birdsong? ◦ Use case: relating neural activity to song syllables

Slide 25

Slide 25 text

Introduction The problem: segmentation ◦ Supervised learning methods can accurately classify segments* ◦ But fail when: ◦ Segmenting fails due to ◦ Noise ◦ Change in song because of experiment ◦ Bird has a song that is not easily segmented ◦ Canary song *hybrid-vocal-classifier.readthedocs.io

Slide 26

Slide 26 text

Introduction The problem: segmentation ◦ Supervised learning methods can accurately classify segments* ◦ But fail when: ◦ Segmenting fails due to ◦ Noise ◦ Change in song because of experiment ◦ Bird has a song that is not easily segmented ◦ Canary song The solution? Neural networks, of course *hybrid-vocal-classifier.readthedocs.io

Slide 27

Slide 27 text

Methods But which neural network? ◦ Compare neural networks ◦ Requirements: ◦ Find edges of syllables (segment) ◦ Label syllables (annotate) ◦ Training data ◦ Use spectrograms as input ◦ Annotate all time steps

Slide 28

Slide 28 text

Methods Neural network types ◦ Fully convolutional network: FCN ◦ input is spectrogram

Slide 29

Slide 29 text

Methods Neural network types ◦ Fully convolutional network: FCN ◦ input is spectrogram ◦ convolutional layers + max pooling layers

Slide 30

Slide 30 text

Methods Neural network types ◦ Fully convolutional network: FCN ◦ input is spectrogram ◦ convolutional layers + max pooling layers ◦ output is spectrogram with every point labeled Long Shelhamer Darrell 2015

Slide 31

Slide 31 text

Methods Neural network types ◦ Fully convolutional network: FCN ◦ input is spectrogram ◦ convolutional layers + max pooling layers ◦ output is spectrogram with every point labeled Koumura Okanoya 2016

Slide 32

Slide 32 text

Methods Neural network types: ◦ Convolutional + bidirectional LSTM: CNN-biLSTM ◦ Input layer is spectrogram

Slide 33

Slide 33 text

Methods Neural network types: ◦ Convolutional + bidirectional LSTM: CNN-biLSTM ◦ Input layer is spectrogram ◦ CNN + Max. pooling

Slide 34

Slide 34 text

Methods Neural network types: ◦ Convolutional + bidirectional LSTM: CNN-biLSTM ◦ Input layer is spectrogram ◦ CNN + Max. pooling ◦ Bi-directional LSTM

Slide 35

Slide 35 text

Methods Neural network types: ◦ Convolutional + bidirectional LSTM: CNN-biLSTM ◦ Input layer is spectrogram ◦ CNN + Max. pooling ◦ Bi-directional LSTM

Slide 36

Slide 36 text

Methods To compare: ◦ “Learning curve”: train with training sets of increasing size ◦ What’s the best we can do with the least amount of data ◦ Replicate for each size with random grab of song files ◦ Measures: accuracy ◦ Framewise accuracy

Slide 37

Slide 37 text

Results Learning curve

Slide 38

Slide 38 text

Results CNN-biLSTM error can improve with some changes in hyperparameters 400 second training set 152  513 frequency bins 128  512 hidden units Bengalese finch test set error histograms (30 training files, 400s, ~900 test files)

Slide 39

Slide 39 text

Results CNN-biLSTM can achieve almost zero error on a very stereotyped song, as would be expected Zebra finch test set error histograms (14 training files, ~20s, ~150 test files)

Slide 40

Slide 40 text

Discussion Initial results suggest CNN-biLSTM outperforms FCN for segmentation Accuracy of CNN-biLSTM can improve further by changing hyperparameters CNN-biLSTM performs near perfect when segmenting syllables from a species with a very stereotyped song

Slide 41

Slide 41 text

Discussion Future work ◦ Further study feedforward vs. recurrent ◦ Consider other architectures developed ◦ “sliding window” convolutional network ◦ Koumura Okanoya 2016: https://github.com/cycentum/birdsong-recognition ◦ similar architecture in Keras: https://github.com/kylerbrown/deepchirp ◦ Applications ◦ Automated segmentation of song ◦ Speech disorder diagnosis, improved ASR

Slide 42

Slide 42 text

Acknowledgments Gardner and Sober labs ◦ http://people.bu.edu/timothyg/Home.html ◦ https://scholarblogs.emory.edu/soberlab/ Funding sources ◦ NIH ◦ NSF ◦ NVIDIA GPU grant program Fork us on Github https://github.com/yardencsGitHub/tf_syllable_segmentation_annotation https://github.com/NickleDave/tf_syllable_segmentation_annotation https://github.com/NickleDave/fcn-syl-seg