Neural networks for segmentation of vocalizations

Neural networks for segmentation of vocalizations

PyDataNYC 2017 talk

9ae315da9dbd0b9cec19ab9b595915b2?s=128

David Nicholson

November 27, 2017
Tweet

Transcript

  1. David Nicholson, Ph.D. Sober lab, Emory University nicholdav nickledave Neural

    networks for segmentation of vocalizations Yarden Cohen, Ph.D. Gardner lab, Boston University YardenJCohen yardencsgithub
  2. Introduction Automatic Speech Recognition (ASR) / Speech to Text Potter

    et al. 1947, Visible Speech
  3. Introduction State of the art: LSTM networks https://awni.github.io/speech-recognition/

  4. Introduction State of the art: LSTM networks ◦ Specifically bidirectional

    LSTMs https://awni.github.io/speech-recognition/
  5. Introduction The problem: segmentation ◦ State-of-the-art networks explicitly avoid segmentation

    Graves et al. 2005
  6. Introduction The problem: segmentation ◦ State-of-the-art networks explicitly avoid segmentation

    Graves et al. 2005
  7. Introduction The problem: segmentation ◦ State-of-the-art networks explicitly avoid segmentation

    Graves et al. 2005
  8. Introduction The problem: segmentation ◦ State-of-the-art networks explicitly avoid segmentation

    Graves et al. 2005
  9. Introduction But there are many cases where we care about

    segmentation ◦ Diagnosis of speech disorders, e.g., stuttering ◦ Understand how brain controls speech ◦ At the level of muscles ◦ At the level of phonemes ◦ As a feature that may make ASR more robust ◦ Background noise ◦ Different accents Büchel C, Sommer M (2004) What Causes Stuttering?. PLOS Biology
  10. Introduction Big picture: segmentation = finding signal in noise ◦

    Object recognition ◦ Image segmentation ◦ Finding piano keys in music (S. Böck and M. Schedl 2012) ◦ Finding events embedded in background noise (Parascandolo, Huttunen, and Virtanen 2016) ◦ Finding elements of birdsong (Koumura, and Okanoya 2016)
  11. Introduction Birdsong: an MNIST for benchmarking segmentation architectures? ◦ Bengalese

    Finch song repository: https://figshare.com/articles/Bengalese_Finch_song_repository/4805749
  12. Introduction Songbirds: a model system for understanding how the brain

    learns speech and similar motor skills ◦ learn their vocalizations by social interaction, from a tutor http://songbirdscience.com/ Photo: Jon Sakata. Spectrogram: Dooling lab
  13. Introduction Segmenting birdsong ◦ Songbird “syllables” ◦ like phonemes in

    human speech
  14. Introduction Segmenting birdsong ◦ Songbird “syllables” ◦ like phonemes in

    human speech ◦ In many species, song syllables separated by brief silences
  15. Introduction Segmenting birdsong ◦ Songbird “syllables” ◦ like phonemes in

    human speech ◦ In many species, song syllables separated by brief silences ◦  easier to segment than speech, a good test case
  16. Introduction Segmenting birdsong ◦ Songs can vary between species as

    much as phonemes and syntax vary between languages Zebra finch: 1 Sec
  17. Introduction Segmenting birdsong ◦ Songs can vary between species as

    much as phonemes and syntax vary between languages Zebra finch: 1 Sec 1 Sec Bengalese finch:
  18. Introduction Segmenting birdsong ◦ Songs can vary between species as

    much as phonemes and syntax vary between languages Zebra finch: 1 Sec Canaries: 1 Sec
  19. Introduction Segmenting birdsong ◦ Songs can vary between species as

    much as phonemes and syntax vary between languages ◦ Allows us to test how well models generalize across different “languages” Zebra finch: 1 Sec Canaries: 1 Sec
  20. Introduction Why do we care about segmenting birdsong? ◦ Use

    case: behavioral experiments
  21. Introduction Why do we care about segmenting birdsong? ◦ Use

    case: behavioral experiments
  22. Introduction Why do we care about segmenting birdsong? ◦ Use

    case: relating neural activity to song syllables
  23. Introduction Why do we care about segmenting birdsong? ◦ Use

    case: relating neural activity to song syllables
  24. Introduction Why do we care about segmenting birdsong? ◦ Use

    case: relating neural activity to song syllables
  25. Introduction The problem: segmentation ◦ Supervised learning methods can accurately

    classify segments* ◦ But fail when: ◦ Segmenting fails due to ◦ Noise ◦ Change in song because of experiment ◦ Bird has a song that is not easily segmented ◦ Canary song *hybrid-vocal-classifier.readthedocs.io
  26. Introduction The problem: segmentation ◦ Supervised learning methods can accurately

    classify segments* ◦ But fail when: ◦ Segmenting fails due to ◦ Noise ◦ Change in song because of experiment ◦ Bird has a song that is not easily segmented ◦ Canary song The solution? Neural networks, of course *hybrid-vocal-classifier.readthedocs.io
  27. Methods But which neural network? ◦ Compare neural networks ◦

    Requirements: ◦ Find edges of syllables (segment) ◦ Label syllables (annotate) ◦ Training data ◦ Use spectrograms as input ◦ Annotate all time steps
  28. Methods Neural network types ◦ Fully convolutional network: FCN ◦

    input is spectrogram
  29. Methods Neural network types ◦ Fully convolutional network: FCN ◦

    input is spectrogram ◦ convolutional layers + max pooling layers
  30. Methods Neural network types ◦ Fully convolutional network: FCN ◦

    input is spectrogram ◦ convolutional layers + max pooling layers ◦ output is spectrogram with every point labeled Long Shelhamer Darrell 2015
  31. Methods Neural network types ◦ Fully convolutional network: FCN ◦

    input is spectrogram ◦ convolutional layers + max pooling layers ◦ output is spectrogram with every point labeled Koumura Okanoya 2016
  32. Methods Neural network types: ◦ Convolutional + bidirectional LSTM: CNN-biLSTM

    ◦ Input layer is spectrogram
  33. Methods Neural network types: ◦ Convolutional + bidirectional LSTM: CNN-biLSTM

    ◦ Input layer is spectrogram ◦ CNN + Max. pooling
  34. Methods Neural network types: ◦ Convolutional + bidirectional LSTM: CNN-biLSTM

    ◦ Input layer is spectrogram ◦ CNN + Max. pooling ◦ Bi-directional LSTM
  35. Methods Neural network types: ◦ Convolutional + bidirectional LSTM: CNN-biLSTM

    ◦ Input layer is spectrogram ◦ CNN + Max. pooling ◦ Bi-directional LSTM
  36. Methods To compare: ◦ “Learning curve”: train with training sets

    of increasing size ◦ What’s the best we can do with the least amount of data ◦ Replicate for each size with random grab of song files ◦ Measures: accuracy ◦ Framewise accuracy
  37. Results Learning curve

  38. Results CNN-biLSTM error can improve with some changes in hyperparameters

    400 second training set 152  513 frequency bins 128  512 hidden units Bengalese finch test set error histograms (30 training files, 400s, ~900 test files)
  39. Results CNN-biLSTM can achieve almost zero error on a very

    stereotyped song, as would be expected Zebra finch test set error histograms (14 training files, ~20s, ~150 test files)
  40. Discussion Initial results suggest CNN-biLSTM outperforms FCN for segmentation Accuracy

    of CNN-biLSTM can improve further by changing hyperparameters CNN-biLSTM performs near perfect when segmenting syllables from a species with a very stereotyped song
  41. Discussion Future work ◦ Further study feedforward vs. recurrent ◦

    Consider other architectures developed ◦ “sliding window” convolutional network ◦ Koumura Okanoya 2016: https://github.com/cycentum/birdsong-recognition ◦ similar architecture in Keras: https://github.com/kylerbrown/deepchirp ◦ Applications ◦ Automated segmentation of song ◦ Speech disorder diagnosis, improved ASR
  42. Acknowledgments Gardner and Sober labs ◦ http://people.bu.edu/timothyg/Home.html ◦ https://scholarblogs.emory.edu/soberlab/ Funding

    sources ◦ NIH ◦ NSF ◦ NVIDIA GPU grant program Fork us on Github https://github.com/yardencsGitHub/tf_syllable_segmentation_annotation https://github.com/NickleDave/tf_syllable_segmentation_annotation https://github.com/NickleDave/fcn-syl-seg