3D Convolutional Neural Networks for Audio-Visual Recognition

The essential problem is to find the correspondence between the audio and visual streams, which is the goal of this work. We proposed the utilization of a coupled 3D Convolutional Neural Network (CNN) architecture that can map both modalities into a representation space to evaluate the correspondence of audio-visual streams using the learned multimodal features.

Check relevant links: [Paper, GitHub, Project Page]

This repository projects the implementation of our paper: 3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition.

im2

The input pipeline must be prepared by the users. This code is aimed to provide the implementation for Coupled 3D Convolutional Neural Networks for audio-visual matching. Lip-reading can be a specific application for this work.

General View

Audio-visual recognition (AVR) has been considered as a solution for speech recognition tasks when the audio is corrupted, as well as a visual recognition method used for speaker verification in multi-speaker scenarios. The approach of AVR systems is to leverage the extracted information from one modality to improve the recognition ability of the other modality by complementing the missing information.

The Problem and the Approach

The essential problem is to find the correspondence between the audio and visual streams, which is the goal of this work. We proposed the utilization of a coupled 3D Convolutional Neural Network (CNN) architecture that can map both modalities into a representation space to evaluate the correspondence of audio-visual streams using the learned multimodal features.

How to leverage 3D Convolutional Neural Networks?

The proposed architecture will incorporate both spatial and temporal information jointly to effectively find the correlation between temporal information for different modalities. By using a relatively small network architecture and much smaller dataset, our proposed method surpasses the performance of the existing similar methods for audio-visual matching which use CNNs for feature representation. We also demonstrate that an effective pair selection method can significantly increase the performance.

Processing

In the visual section, the videos are post-processed to have an equal frame rate of 30 f/s. Then, face tracking and mouth area extraction are performed on the videos using the dlib library [dlib]. Finally, all mouth areas are resized to have the same size and concatenated to form the input feature cube. The dataset does not contain any audio files. The audio files are extracted from videos using the FFmpeg framework [ffmpeg]. The processing pipeline is the below figure.

readme_images/processing.gif

Input Pipeline for this work

The proposed architecture utilizes two non-identical ConvNets which use a pair of speech and video streams. The network input is a pair of features that represent lip movement and speech features extracted from 0.3 seconds of a video clip. The main task is to determine if a stream of audio corresponds with a lip motion clip within the desired stream duration. In the two next sub-sections, we are going to explain the inputs for speech and visual streams.

Speech Net

On the time axis, the temporal features are non-overlapping 20ms windows which are used for the generation of spectrum features that possess a local characteristic. The input speech feature map, which is represented as an image cube, corresponds to the spectrogram as well as the first and second-order derivatives of the MFEC features. These three channels correspond to the image depth. Collectively from a 0.3-second clip, 15 temporal feature sets (each forms 40 MFEC features) can be derived which form a speech feature cube. Each input feature map for a single audio stream has a dimensionality of $15 \times 40 \times 3$. This representation is depicted in the following figure:

readme_images/Speech_GIF.gif

The speech features have been extracted using the SpeechPy package.

Visual Net

The frame rate of each video clip used in this effort is 30 f/s. Consequently, 9 successive image frames form the 0.3-second visual stream. The input of the visual stream of the network is a cube of size 9x60x100, where 9 is the number of frames that represent the temporal information. Each channel is a 60×100 gray-scale image of the mouth region.

readme_images/lip_motion.jpg

Architecture

The architecture is a coupled 3D convolutional neural network in which two different networks with different sets of weights must be trained. For the visual network, the lip motions’ spatial information alongside the temporal information are incorporated jointly and will be fused for exploiting the temporal correlation. For the audio network, the extracted energy features are considered as a spatial dimension, and the stacked audio frames form the temporal dimension. In the proposed 3D CNN architecture, the convolutional operations are performed on successive temporal frames for both audio-visual streams.

readme_images/DNN-Coupled.png
Scroll to Top