Although voice conversion (VC) algorithms have achieved remarkable success along with the development of machine learning, superior performance is still difficult to achieve when using nonparallel data. In this paper, we propose using a cycle-consistent adversarial network (CycleGAN) for nonparallel data-based VC training. A CycleGAN is a generative adversarial network (GAN) originally developed for unpaired image-to-image translation. A subjective evaluation of inter-gender conversion demonstrated that the proposed method significantly outperformed a method based on the Merlin open source neural network speech synthesis system (a parallel VC system adapted for our setup) and a GAN-based parallel VC system. This is the first research to show that the performance of a nonparallel VC method can exceed that of state-of-the-art parallel VC methods.
Conclusion and Future Work
We have developed a high-quality nonparallel VC method based on a CycleGAN. We compared the proposed method with two state-of-the-art parallel VC methods, one based on a Merlin system and the other based on a GAN. In an inter-gender conversion experiment, the proposed nonparallel method performed significantly better in terms of speech quality and speaker similarity than the two parallel methods.
Future work includes developing a method for strictly constraining the linguistic information to be invariant for CycleGAN. We also plan to further improve the speech quality and speaker similarity and to compare our method with others using dataset of the Voice Conversion Challenge.