introduction

Audio and visual modalities are the most commonly used by humans to identify other humans and sense their emotional state. Features extracted from these two modalities are often highly correlated, providing us with the capability of imagining the visual appearance of a person just by listening to his voice or building some expectations about the tone or pitch of her voice from a picture.

We present Speech2YouTuber, a method that aims at imagining an image of a face that could correspond to a provided speech utterance. Our solution is based on recent advances on deep generative models, namely Variational Auto-Encoders (VAE) and Generative Adversarial Networks (GAN). Speech2YouTuber is inspired on previous works that have conditioned the generation of images using text or audio features. In this work, we condition the generative process with raw speech.

If you find this work useful, please consider citing us:

Download our paper in pdf here.

DataSet

This section describes the procedure of the data collection module, from the video dowloading to all of the preprocessing steps applied to the audio signals and the video frames.

1/ YouTubers Collection: The last 15 videos uploaded to the channel of 62 different Spanish speaker(29 males and 33 females) from different ethnicities and accents.

2/ Audio preprocessing: (AAC) format at 44100 Hz and stereo converted to WAV, as well as re-sampled to 16 kHz with 16 bits per sample and converted to mono format.

3/ Face Detection: Haar Feature-based Cascade Classifier was used(pre-trained on frontal face features). We detect the bounding box coordinates, an image of the cropped face in BGR format, the full frame and a 4 seconds length speech frame, which encompasses 2 seconds ahead and behind the given frame. Moreover, we keep an identity (name) for each sample, being able to distinguish between speakers.

4/ Audio overlapping: Whenever it has been possible, that is, whenever there have been detected faces in consecutive frames, it has been applied an overlapping of 2 seconds between speech frames.

5/ Image preprocessing: All images, before starting working with them, have been normalized and resized to 64x64.

6/ Speech frames preprocessing: Each speech frame has been normalized between [-1, 1] as well. Moreover, it has been applied a pre-emphasis step to increase the amplitude of the higher frequency bands while decreasing the amplitude of the lower ones, as higher frequencies are more important for signal disambiguation.

model

In the end we managed to get this Dataset:

Sex Speakers Faces Speech(sec)
Male 29 26299 105196
Female 33 15900 63600
Total 62 42199 168796
Model

model


Diagram of the speech-to-image synthesis method. Orange blocks stand for the audio embedding vector of size 128, while pink blocks represent convolutional/deconvolutional blocks.

Results

Here is an example of results with only two male as our dataset: Generated Images

Here are the general results we get:

Experiment Score Std
LS-GAN peojecting input 2.91 0.11
LS-GAN concat. input 2.05 0.02
LS-GAN concat. input, TTUR 3.00 0.07
LS-GAN concat. input, TTUR + dropouts 2.63 0.09
AC-GAN projecting one-hot vector 2.12 0.04




Here is the better results we get when we clean the dataset.

Experiment Score Std
Full Dataset 2.14 0.13
Clean Dataset 2.21 0.41
code

See our code on Github here

Slides
acknowledgements

We want to thank our technical support team:

   
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GeForce GTX Titan Z and Titan X used in this work. logo-nvidia
The Image Processing Group at the UPC is a SGR14 Consolidated Research Group recognized and sponsored by the Catalan Government (Generalitat de Catalunya) through its AGAUR office. logo-catalonia
This work has been developed in the framework of projects TEC2013-43935-R and TEC2016-75976-R, financed by the Spanish Ministerio de Economía y Competitividad and the European Regional Development Fund (ERDF). logo-spain