A Vocoder-free WaveNet Voice Conversion with Non-Parallel Data

In a typical voice conversion system, vocoder is commonly used for speech-to-features analysis and features-to-speech synthesis. However, vocoder can be a source of speech quality degradation. This paper presents a vocoder-free voice conversion approach using WaveNet for non-parallel training data. Instead of dealing with the intermediate features, the proposed approach utilizes the WaveNet to map the Phonetic PosteriorGrams (PPGs) to the waveform samples directly. In this way, we avoid the estimation errors caused by vocoder and feature conversion. Additionally, as PPG is assumed to be speaker independent, the proposed method also reduces the feature mismatch problem in WaveNet vocoder based approaches. Experimental results conducted on the CMU-ARCTIC database show that the proposed approach significantly outperforms the baseline approaches in terms of speech quality.

Conclusions

This paper presents a vocoder-free voice conversion approach using WaveNet for non-parallel data. The proposed approach does not rely on the vocoder features for conversion, which reduces the feature mismatch problem in WaveNet vocoder based approaches. Experiment results show that the WaveNet-VC significantly outperforms the baseline methods in terms of quality, while maintain the speaker identity.

Source