This paper proposes Scyclone, a high-quality voice conversion (VC) technique without parallel data training. Scyclone improves speech naturalness and speaker similarity of the converted speech by introducing CycleGAN-based spectrogram conversion with a simplified WaveRNN-based vocoder. In Scyclone, a linear spectrogram is used as the conversion features instead of vocoder parameters, which avoids quality degradation due to extraction errors in fundamental frequency and voiced/unvoiced parameters. The spectrogram of source and target speakers are modeled by modified CycleGAN networks, and the waveform is reconstructed using the simplified WaveRNN with a single Gaussian probability density function. The subjective experiments with completely unpaired training data show that Scyclone is significantly better than CycleGAN-VC2, one of the existing state-of-the-art parallel-data-free VC techniques.
Some of the speech samples used in the MOS tests are available at the following URL: https://bit.ly/2NFvLhk
Conclusions
This paper has proposed Scyclone, a parallel-data-free VC technique using CycleGAN-based spectrogram conversion and a simplified WaveRNN-based neural vocoder with a Gaussian loss. To improve the modeling and conversion performance in CycleGAN, the network was modified in which non-encoder-decoder architecture was employed with the spectral normalization. Experiments were conducted under the condition of completely unpaired training data. The subjective evaluation results have shown the superiority of Scyclone to the state-of-the-art parallel-data-free VC, CycleGAN-VC2. More detailed description and evaluation of Scylone will be presented in our next article.