Unsupervised Representation Disentanglement using Cross Domain Features and Adversarial Learning in Variational Autoencoder based Voice Conversion

An effective approach for voice conversion (VC) is to disentangle linguistic content from other components in the speech signal. The effectiveness of variational autoencoder (VAE) based VC (VAE-VC), for instance, strongly relies on this principle. In our prior work, we proposed a cross-domain VAE-VC (CDVAE-VC) framework, which utilized acoustic features of different properties, to improve the performance of VAE-VC. We believed that the success came from more disentangled latent representations. In this paper, we extend the CDVAE-VC framework by incorporating the concept of adversarial learning...

MoEVC: A Mixture-of-experts Voice Conversion System with Sparse Gating Mechanism for Accelerating Online Computation

With the recent advancements of deep learning technologies, the performance of voice conversion (VC) in terms of quality and similarity has been significantly improved. However, heavy computations are generally required for deep-learning-based VC systems, which can cause notable latency and thus confine their deployments in real-world applications. Therefore, increasing online computation efficiency has become an important task. In this study, we propose a novel mixture-of-experts (MoE) based VC system. The MoE model uses a gating mechanism to specify optimal weights to feature maps to increas...

Generalization of Spectrum Differential based Direct Waveform Modification for Voice Conversion

We present a modification to the spectrum differential based direct waveform modification for voice conversion (DIFFVC) so that it can be directly applied as a waveform generation module to voice conversion models. The recently proposed DIFFVC avoids the use of a vocoder, meanwhile preserves rich spectral details hence capable of generating high quality converted voice. To apply the DIFFVC framework, a model that can estimate the spectral differential from the F0 transformed input speech needs to be trained beforehand. This requirement imposes several constraints, including a limitation on the...

Voice Conversion Based on Cross-Domain Features Using Variational Auto Encoders

An effective approach to non-parallel voice conversion (VC) is to utilize deep neural networks (DNNs), specifically variational auto encoders (VAEs), to model the latent structure of speech in an unsupervised manner. A previous study has confirmed the effectiveness of VAE using the STRAIGHT spectra for VC. However, VAE using other types of spectral features such as melcepstral coefficients (MCCs), which are related to human perception and have been widely used in VC, have not been properly investigated. Instead of using one specific type of spectral feature, it is expected that VAE may benefit...