Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Neural networks for segmentation of vocalizations

Neural networks for segmentation of vocalizations

PyDataNYC 2017 talk

David Nicholson

November 27, 2017
Tweet

More Decks by David Nicholson

Other Decks in How-to & DIY

Transcript

  1. David Nicholson, Ph.D. Sober lab, Emory University nicholdav nickledave Neural

    networks for segmentation of vocalizations Yarden Cohen, Ph.D. Gardner lab, Boston University YardenJCohen yardencsgithub
  2. Introduction State of the art: LSTM networks ◦ Specifically bidirectional

    LSTMs https://awni.github.io/speech-recognition/
  3. Introduction But there are many cases where we care about

    segmentation ◦ Diagnosis of speech disorders, e.g., stuttering ◦ Understand how brain controls speech ◦ At the level of muscles ◦ At the level of phonemes ◦ As a feature that may make ASR more robust ◦ Background noise ◦ Different accents Büchel C, Sommer M (2004) What Causes Stuttering?. PLOS Biology
  4. Introduction Big picture: segmentation = finding signal in noise ◦

    Object recognition ◦ Image segmentation ◦ Finding piano keys in music (S. Böck and M. Schedl 2012) ◦ Finding events embedded in background noise (Parascandolo, Huttunen, and Virtanen 2016) ◦ Finding elements of birdsong (Koumura, and Okanoya 2016)
  5. Introduction Birdsong: an MNIST for benchmarking segmentation architectures? ◦ Bengalese

    Finch song repository: https://figshare.com/articles/Bengalese_Finch_song_repository/4805749
  6. Introduction Songbirds: a model system for understanding how the brain

    learns speech and similar motor skills ◦ learn their vocalizations by social interaction, from a tutor http://songbirdscience.com/ Photo: Jon Sakata. Spectrogram: Dooling lab
  7. Introduction Segmenting birdsong ◦ Songbird “syllables” ◦ like phonemes in

    human speech ◦ In many species, song syllables separated by brief silences
  8. Introduction Segmenting birdsong ◦ Songbird “syllables” ◦ like phonemes in

    human speech ◦ In many species, song syllables separated by brief silences ◦  easier to segment than speech, a good test case
  9. Introduction Segmenting birdsong ◦ Songs can vary between species as

    much as phonemes and syntax vary between languages Zebra finch: 1 Sec
  10. Introduction Segmenting birdsong ◦ Songs can vary between species as

    much as phonemes and syntax vary between languages Zebra finch: 1 Sec 1 Sec Bengalese finch:
  11. Introduction Segmenting birdsong ◦ Songs can vary between species as

    much as phonemes and syntax vary between languages Zebra finch: 1 Sec Canaries: 1 Sec
  12. Introduction Segmenting birdsong ◦ Songs can vary between species as

    much as phonemes and syntax vary between languages ◦ Allows us to test how well models generalize across different “languages” Zebra finch: 1 Sec Canaries: 1 Sec
  13. Introduction Why do we care about segmenting birdsong? ◦ Use

    case: relating neural activity to song syllables
  14. Introduction Why do we care about segmenting birdsong? ◦ Use

    case: relating neural activity to song syllables
  15. Introduction Why do we care about segmenting birdsong? ◦ Use

    case: relating neural activity to song syllables
  16. Introduction The problem: segmentation ◦ Supervised learning methods can accurately

    classify segments* ◦ But fail when: ◦ Segmenting fails due to ◦ Noise ◦ Change in song because of experiment ◦ Bird has a song that is not easily segmented ◦ Canary song *hybrid-vocal-classifier.readthedocs.io
  17. Introduction The problem: segmentation ◦ Supervised learning methods can accurately

    classify segments* ◦ But fail when: ◦ Segmenting fails due to ◦ Noise ◦ Change in song because of experiment ◦ Bird has a song that is not easily segmented ◦ Canary song The solution? Neural networks, of course *hybrid-vocal-classifier.readthedocs.io
  18. Methods But which neural network? ◦ Compare neural networks ◦

    Requirements: ◦ Find edges of syllables (segment) ◦ Label syllables (annotate) ◦ Training data ◦ Use spectrograms as input ◦ Annotate all time steps
  19. Methods Neural network types ◦ Fully convolutional network: FCN ◦

    input is spectrogram ◦ convolutional layers + max pooling layers
  20. Methods Neural network types ◦ Fully convolutional network: FCN ◦

    input is spectrogram ◦ convolutional layers + max pooling layers ◦ output is spectrogram with every point labeled Long Shelhamer Darrell 2015
  21. Methods Neural network types ◦ Fully convolutional network: FCN ◦

    input is spectrogram ◦ convolutional layers + max pooling layers ◦ output is spectrogram with every point labeled Koumura Okanoya 2016
  22. Methods Neural network types: ◦ Convolutional + bidirectional LSTM: CNN-biLSTM

    ◦ Input layer is spectrogram ◦ CNN + Max. pooling
  23. Methods Neural network types: ◦ Convolutional + bidirectional LSTM: CNN-biLSTM

    ◦ Input layer is spectrogram ◦ CNN + Max. pooling ◦ Bi-directional LSTM
  24. Methods Neural network types: ◦ Convolutional + bidirectional LSTM: CNN-biLSTM

    ◦ Input layer is spectrogram ◦ CNN + Max. pooling ◦ Bi-directional LSTM
  25. Methods To compare: ◦ “Learning curve”: train with training sets

    of increasing size ◦ What’s the best we can do with the least amount of data ◦ Replicate for each size with random grab of song files ◦ Measures: accuracy ◦ Framewise accuracy
  26. Results CNN-biLSTM error can improve with some changes in hyperparameters

    400 second training set 152  513 frequency bins 128  512 hidden units Bengalese finch test set error histograms (30 training files, 400s, ~900 test files)
  27. Results CNN-biLSTM can achieve almost zero error on a very

    stereotyped song, as would be expected Zebra finch test set error histograms (14 training files, ~20s, ~150 test files)
  28. Discussion Initial results suggest CNN-biLSTM outperforms FCN for segmentation Accuracy

    of CNN-biLSTM can improve further by changing hyperparameters CNN-biLSTM performs near perfect when segmenting syllables from a species with a very stereotyped song
  29. Discussion Future work ◦ Further study feedforward vs. recurrent ◦

    Consider other architectures developed ◦ “sliding window” convolutional network ◦ Koumura Okanoya 2016: https://github.com/cycentum/birdsong-recognition ◦ similar architecture in Keras: https://github.com/kylerbrown/deepchirp ◦ Applications ◦ Automated segmentation of song ◦ Speech disorder diagnosis, improved ASR
  30. Acknowledgments Gardner and Sober labs ◦ http://people.bu.edu/timothyg/Home.html ◦ https://scholarblogs.emory.edu/soberlab/ Funding

    sources ◦ NIH ◦ NSF ◦ NVIDIA GPU grant program Fork us on Github https://github.com/yardencsGitHub/tf_syllable_segmentation_annotation https://github.com/NickleDave/tf_syllable_segmentation_annotation https://github.com/NickleDave/fcn-syl-seg