We present a deep learning method for singing voice conversion. The proposed network is not conditioned on the text or on the notes, and it directly converts the audio of one singer to the voice of another. Training is performed without any form of supervision: no lyrics or any kind of phonetic features, no notes, and no matching samples between singers. The proposed network employs a single CNN encoder for all singers, a single WaveNet decoder, and a classifier that enforces the latent representation to be singer-agnostic. Each singer is represented by one embedding vector, which the decoder is conditioned on. In order to deal with relatively small datasets, we propose a new data augmentation scheme, as well as new training losses and protocols that are based on backtranslation. Our evaluation presents evidence that the conversion produces natural signing voices that are highly recognizable as the target singer.
Conclusions
Our unsupervised method is shown to produce high quality audio, which is recognizable as the target voice. The method is based on a single CNN encoder and on a single conditional WaveNet decoder. A confusion network encourages the latest space to be singer-agnostic and new types of augmentation are applied in order to overcome the limited amount of available data. Crucially, training is done in two phases, by employing a backtranslation method that employs mixup identities.
As future work, we would like to find out whether a similar method can perform the conversion in the presence of background music. We believe that this can be done in an unsupervised way, without relying on a supervised voice separation technique for preprocessing.