Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with CycleGAN

Tags: 2020 | Berrak Sisman | Haizhou Li | Kun Zhou | Zongyang Du

Cross-lingual voice conversion aims to change source speaker's voice to sound like that of target speaker, when source and target speakers speak different languages. It relies on non-parallel training data from two different languages, hence, is more challenging than mono-lingual voice conversion. Previous studies on cross-lingual voice conversion mainly focus on spectral conversion with a linear transformation for F0 transfer. However, as an important prosodic factor, F0 is inherently hierarchical, thus it is insufficient to just use a linear method for conversion. We propose the use of conti...

VAW-GAN for Singing Voice Conversion with Non-parallel Training Data

Voice Conversion

Tags: 2020 | Berrak Sisman | Haizhou Li | Junchen Lu | Kun Zhou

Singing voice conversion aims to convert singer's voice from source to target without changing singing content. Parallel training data is typically required for the training of singing voice conversion system, that is however not practical in real-life applications. Recent encoder-decoder structures, such as variational autoencoding Wasserstein generative adversarial network (VAW-GAN), provide an effective way to learn a mapping through non-parallel training data. In this paper, we propose a singing voice conversion framework that is based on VAW-GAN. We train an encoder to disentangle singer ...

An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning

Voice Conversion

Tags: 2020 | Berrak Sisman | Haizhou Li | Junichi Yamagishi | Simon King

Speaker identity is one of the important characteristics of human speech. In voice conversion, we change the speaker identity from one to another, while keeping the linguistic content unchanged. Voice conversion involves multiple speech processing techniques, such as speech analysis, spectral conversion, prosody conversion, speaker characterization, and vocoding. With the recent advances in theory and practice, we are now able to produce human-like voice quality with high speaker similarity. In this paper, we provide a comprehensive overview of the state-of-the-art of voice conversion techniqu...

Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion

Voice Conversion

Tags: 2020 | Berrak Sisman | Haizhou Li | Kun Zhou | Mingyang Zhang

Emotional voice conversion aims to convert the emotion of the speech from one state to another while preserving the linguistic content and speaker identity. The prior studies on emotional voice conversion are mostly carried out under the assumption that emotion is speaker-dependent. We believe that emotions are expressed universally across speakers, therefore, the speaker-independent mapping between emotional states of speech is possible. In this paper, we propose to build a speaker-independent emotional voice conversion framework, that can convert anyone's emotion without the need for paralle...

Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data

Voice Conversion

Categories: Adversarial Training | Deep Learning | Generative Adversarial Network

Tags: 2020 | Berrak Sisman | Haizhou Li | Kun Zhou

Emotional voice conversion is to convert the spectrum and prosody to change the emotional patterns of speech, while preserving the speaker identity and linguistic content. Many studies require parallel speech data between different emotional patterns, which is not practical in real life. Moreover, they often model the conversion of fundamental frequency (F0) with a simple linear transform. As F0 is a key aspect of intonation that is hierarchical in nature, we believe that it is more adequate to model F0 in different temporal scales by using wavelet transform. We propose a CycleGAN network to f...

Error Reduction Network for DBLSTM-based Voice Conversion

Voice Conversion

So far, many of the deep learning approaches for voice conversion produce good quality speech by using a large amount of training data. This paper presents a Deep Bidirectional Long Short-Term Memory (DBLSTM) based voice conversion framework that can work with a limited amount of training data. We propose to implement a DBLSTM based average model that is trained with data from many speakers. Then, we propose to perform adaptation with a limited amount of target data. Last but not least, we propose an error reduction network that can improve the voice conversion quality even further. The propo...

Tag: Berrak Sisman