Error Reduction Network for DBLSTM-based Voice Conversion

So far, many of the deep learning approaches for voice conversion produce good quality speech by using a large amount of training data. This paper presents a Deep Bidirectional Long Short-Term Memory (DBLSTM) based voice conversion framework that can work with a limited amount of training data. We propose to implement a DBLSTM based average model that is trained with data from many speakers. Then, we propose to perform adaptation with a limited amount of target data. Last but not least, we propose an error reduction network that can improve the voice conversion quality even further. The propo...

High-quality nonparallel voice conversion based on cycle-consistent adversarial network

Voice Conversion

Tags: 2018 | Fuming Fang | Isao Echizen | Jaime Lorenzo-Trueba | Junichi Yamagishi

Although voice conversion (VC) algorithms have achieved remarkable success along with the development of machine learning, superior performance is still difficult to achieve when using nonparallel data. In this paper, we propose using a cycle-consistent adversarial network (CycleGAN) for nonparallel data-based VC training. A CycleGAN is a generative adversarial network (GAN) originally developed for unpaired image-to-image translation. A subjective evaluation of inter-gender conversion demonstrated that the proposed method significantly outperformed a method based on the Merlin open source neu...

Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities

Voice Conversion

Categories: Dynamic Time Warping | Fourier Transform | Mel-spectrogram | Recurrent Neural Network

Tags: 2017 | Hiroshi Saruwatari | Hiroyuki Miyoshi | Shinnosuke Takamichi | Yuki Saito

Voice conversion (VC) using sequence-to-sequence learning of context posterior probabilities is proposed. Conventional VC using shared context posterior probabilities predicts target speech parameters from the context posterior probabilities estimated from the source speech parameters. Although conventional VC can be built from non-parallel data, it is difficult to convert speaker individuality such as phonetic property and speaking rate contained in the posterior probabilities because the source posterior probabilities are directly used for predicting target speech parameters. In this work, w...

Dictionary Update for NMF-based Voice Conversion Using an Encoder-Decoder Network

Voice Conversion

Categories: Autoencoder | Dynamic Time Warping | Fourier Transform | Mel-spectrogram

In this paper, we propose a dictionary update method for Nonnegative Matrix Factorization (NMF) with high dimensional data in a spectral conversion (SC) task. Voice conversion has been widely studied due to its potential applications such as personalized speech synthesis and speech enhancement. Exemplar-based NMF (ENMF) emerges as an effective and probably the simplest choice among all techniques for SC, as long as a source-target parallel speech corpus is given. ENMF-based SC systems usually need a large amount of bases (exemplars) to ensure the quality of the converted speech. However, a sma...

High quality voice conversion using prosodic and high-resolution spectral features

Voice Conversion

Categories: Autoencoder | Deep Learning | Dynamic Time Warping | Fourier Transform

Voice conversion methods have advanced rapidly over the last decade. Studies have shown that speaker characteristics are captured by spectral feature as well as various prosodic features. Most existing conversion methods focus on the spectral feature as it directly represents the timbre characteristics, while some conversion methods have focused only on the prosodic feature represented by the fundamental frequency. In this paper, a comprehensive framework using deep neural networks to convert both timbre and prosodic features is proposed. The timbre feature is represented by a high-resolution ...

Reducing one-to-many problem in Voice Conversion by equalizing the formant locations using dynamic frequency warping

Voice Conversion

Categories: Dynamic Time Warping | Fourier Transform | Gaussian Mixture Model | Mel-spectrogram

Tags: 2015 | Seyed Hamidreza Mohammadi

In this study, we investigate a solution to reduce the effect of one-to-many problem in voice conversion. One-to-many problem in VC happens when two very similar speech segments in source speaker have corresponding speech segments in target speaker that are not similar to each other. As a result, the mapper function usually over-smoothes the generated features in order to be similar to both target speech segments. In this study, we propose to equalize the formant location of source-target frame pairs using dynamic frequency warping in order to reduce the complexity. After the conversion, anoth...

Category: Dynamic Time Warping