In this paper, we investigate the effectiveness of a quasi-periodic WaveNet (QPNet) vocoder combined with a statistical spectral conversion technique for a voice conversion task. The WaveNet (WN) vocoder has been applied as the waveform generation module in many different voice conversion frameworks and achieves significant improvement over conventional vocoders. However, because of the fixed dilated convolution and generic network architecture, the WN vocoder lacks robustness against unseen input features and often requires a huge network size to achieve acceptable speech quality. Such limitations usually lead to performance degradation in the voice conversion task. To overcome this problem, the QPNet vocoder is applied, which includes a pitch-dependent dilated convolution component to enhance the pitch controllability and attain a more compact network than the WN vocoder. In the proposed method, input spectral features are first converted using a framewise deep neural network, and then the QPNet vocoder generates converted speech conditioned on the linearly converted prosodic and transformed spectral features. The experimental results confirm that the QPNet vocoder achieves significantly better performance than the same-size WN vocoder while maintaining comparable speech quality to the double-size WN vocoder.
Conclusions
In this paper, we investigated the speaker conversion speech generation performances of the QPNet vocoder compared with the full- and compact-sized WN vocoders and the traditional WORLD vocoder. The inputs of each vocoder are the spectral features converted by a framewise DNN-VC model and linear-transformed prosodic features. Furthermore, we also evaluated the effectiveness of two speaker adaption methods for SD WN-based vocoders. Both objective and subjective evaluations confirmed the effectiveness of the speaker adaption technique and the QPNet vocoder, which takes advantage of the pitch-dependent dilated convolution to attain better pitch controllability and achieve comparable quality to the WN vocoder with only half the network size. In future works, we will survey different combinations of the pitch-dependent and fixed dilated convolutions to achieve optimized performance.