Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mini Project Seminar

Mini Project Seminar

A presentation about my mini project at SASTRA.

Sharath Sriram

October 11, 2019
Tweet

Other Decks in Education

Transcript

  1. 3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition Name

    : Sharath Sriram Register number : 120003287 Project Guide : Muthaiah R, Associate Dean, ICT 1
  2. Base Paper Information Title: 3D Convolutional Neural Networks for Cross

    Audio-Visual Matching Recognition Source : https://doi.org/10.1109/ACCESS.2017.2761539 Author : Amirsina Torfi, Seyed Mehdi Iranmanesh, N.M. Nasrabadi, J.M. Dawso 2
  3. Outline • The target of this work is the matching

    of audio and video streams accurately during an Audio Visual Recognition (AVR) process. • A coupled 3D Convolutional Neural Network analyses spatial and temporal data to match and correlate information. • The usage of a smaller dataset for training enhances the accuracy of the algorithm in mapping the two modalities into a single space and evaluate correspondence in information. 3
  4. Problem Statement One of the main challenges of Audio Visual

    Recognition (AVR), which is used as a mechanism to detect audio using lip movements of a video, is the matching of corresponding audio and video streams perfectly. An algorithm to analyse and match spatial and temporal data to find the correlation in information accurately is desired. 4
  5. Introduction • Audio-Visual Recognition(AVR), has been considered a solution for

    speech recognition tasks when the audio is corrupted as well as a visual recognition method used for speaker verification in multi speaker scenarios. • But it is important to map the audio and video streams accurately to achieve the desired results. • A coupled 3D Convolutional Neural Network architecture can map both spatial and temporal modalities into a representation space to evaluate the correspondence of audio-visual streams using multimodal features. 5
  6. Modules Involved • SpeechNet is responsible for mapping the audio

    stream and extracts the features of the audio file. • The Mouth Extraction module processes the input video file and recognises the regions of the video portraying the mouth of an individual. Those areas of the video are extracted separately for further processing. • The Lip Tracking module maps the visual extracted video streams with the audio features to accurately determine and pair the audio and video. 6
  7. Procedure • In the audio section, the speech features are

    extracted from the audio files using SpeechPy package. MFEC (Mel Frequency Energy Coefficients) are used to represent the speech features. • In the visual section, the videos are post-processed to have equal frame rates of 30 f/s. Mouth area extraction is performed on the videos using the dlib library and the extracted areas are resized to have the same size and concatenated to form the input feature cube. • The two networks are coupled at the last level. A Contrastive Loss is used as a discriminative distance metric to optimise the coupling process. 8
  8. Conclusions • A novel coupled 3D convolutional architecture for audio-visual

    stream networks is presented. • The proposed architecture outperforms existing audio-visual matching methods. • Joint learning of spatial and temporal information using 3D convolution is highly effective. • The extraction of local audio features are shown to be promising for audio-visual recognition using convolutional neural networks. 16
  9. References [1]G. Hinton et al., "Deep neural networks for acoustic

    modeling in speech recognition: The shared views of four research groups", IEEE Signal Process. Mag., vol. 29, pp. 82-97, Nov. 2012 [2]Q. V. Le et al., "Building high-level features using large scale unsupervised learning", Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 8595-8598, May 2013 17