High quality voice conversion using prosodic and high-resolution spectral features

Voice conversion methods have advanced rapidly over the last decade. Studies have shown that speaker characteristics are captured by spectral feature as well as various prosodic features. Most existing conversion methods focus on the spectral feature as it directly represents the timbre characteristics, while some conversion methods have focused only on the prosodic feature represented by the fundamental frequency. In this paper, a comprehensive framework using deep neural networks to convert both timbre and prosodic features is proposed. The timbre feature is represented by a high-resolution ...

Reducing one-to-many problem in Voice Conversion by equalizing the formant locations using dynamic frequency warping

In this study, we investigate a solution to reduce the effect of one-to-many problem in voice conversion. One-to-many problem in VC happens when two very similar speech segments in source speaker have corresponding speech segments in target speaker that are not similar to each other. As a result, the mapper function usually over-smoothes the generated features in order to be similar to both target speech segments. In this study, we propose to equalize the formant location of source-target frame pairs using dynamic frequency warping in order to reduce the complexity. After the conversion, anoth...