Mini Project Seminar

3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition Name
: Sharath Sriram Register number : 120003287 Project Guide : Muthaiah R, Associate Dean, ICT 1

Base Paper Information Title: 3D Convolutional Neural Networks for Cross
Audio-Visual Matching Recognition Source : https://doi.org/10.1109/ACCESS.2017.2761539 Author : Amirsina Torfi, Seyed Mehdi Iranmanesh, N.M. Nasrabadi, J.M. Dawso 2

Outline • The target of this work is the matching
of audio and video streams accurately during an Audio Visual Recognition (AVR) process. • A coupled 3D Convolutional Neural Network analyses spatial and temporal data to match and correlate information. • The usage of a smaller dataset for training enhances the accuracy of the algorithm in mapping the two modalities into a single space and evaluate correspondence in information. 3

Problem Statement One of the main challenges of Audio Visual
Recognition (AVR), which is used as a mechanism to detect audio using lip movements of a video, is the matching of corresponding audio and video streams perfectly. An algorithm to analyse and match spatial and temporal data to find the correlation in information accurately is desired. 4

Introduction • Audio-Visual Recognition(AVR), has been considered a solution for
speech recognition tasks when the audio is corrupted as well as a visual recognition method used for speaker verification in multi speaker scenarios. • But it is important to map the audio and video streams accurately to achieve the desired results. • A coupled 3D Convolutional Neural Network architecture can map both spatial and temporal modalities into a representation space to evaluate the correspondence of audio-visual streams using multimodal features. 5

Modules Involved • SpeechNet is responsible for mapping the audio
stream and extracts the features of the audio file. • The Mouth Extraction module processes the input video file and recognises the regions of the video portraying the mouth of an individual. Those areas of the video are extracted separately for further processing. • The Lip Tracking module maps the visual extracted video streams with the audio features to accurately determine and pair the audio and video. 6

System Model Workflow: Fig. 1 Overview of the proposed method
7

Procedure • In the audio section, the speech features are
extracted from the audio files using SpeechPy package. MFEC (Mel Frequency Energy Coefficients) are used to represent the speech features. • In the visual section, the videos are post-processed to have equal frame rates of 30 f/s. Mouth area extraction is performed on the videos using the dlib library and the extracted areas are resized to have the same size and concatenated to form the input feature cube. • The two networks are coupled at the last level. A Contrastive Loss is used as a discriminative distance metric to optimise the coupling process. 8

System Model SpeechPy Working Fig. 2 Working of SpeechPy Python
package 9

System Model Fig. 3 Speech Feature Generation 10

System Model Mouth Area Extraction Fig. 4 Mouth Area Extraction
of a 0.3 second video stream 11

System Model Fig 5. Coupled Architecture 12

Results & Analysis Fig. 6 - Architecture of Audio Stream
13

Results & Analysis Fig. 7 - Architecture of Video Stream
14

Results & Analysis Fig. 8 - Audio and Video Feature
Maps for Pair Generation 15

Conclusions • A novel coupled 3D convolutional architecture for audio-visual
stream networks is presented. • The proposed architecture outperforms existing audio-visual matching methods. • Joint learning of spatial and temporal information using 3D convolution is highly effective. • The extraction of local audio features are shown to be promising for audio-visual recognition using convolutional neural networks. 16

References [1]G. Hinton et al., "Deep neural networks for acoustic
modeling in speech recognition: The shared views of four research groups", IEEE Signal Process. Mag., vol. 29, pp. 82-97, Nov. 2012 [2]Q. V. Le et al., "Building high-level features using large scale unsupervised learning", Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 8595-8598, May 2013 17

Mini Project Seminar

Mini Project Seminar

Sharath Sriram

Other Decks in Education

Featured

Transcript

3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition Name

Base Paper Information Title: 3D Convolutional Neural Networks for Cross

Outline • The target of this work is the matching

Problem Statement One of the main challenges of Audio Visual

Introduction • Audio-Visual Recognition(AVR), has been considered a solution for

Modules Involved • SpeechNet is responsible for mapping the audio

System Model Workflow: Fig. 1 Overview of the proposed method

Procedure • In the audio section, the speech features are

System Model SpeechPy Working Fig. 2 Working of SpeechPy Python

System Model Fig. 3 Speech Feature Generation 10

System Model Mouth Area Extraction Fig. 4 Mouth Area Extraction

System Model Fig 5. Coupled Architecture 12

Results & Analysis Fig. 6 - Architecture of Audio Stream

Results & Analysis Fig. 7 - Architecture of Video Stream

Results & Analysis Fig. 8 - Audio and Video Feature

Conclusions • A novel coupled 3D convolutional architecture for audio-visual

References [1]G. Hinton et al., "Deep neural networks for acoustic