Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with CycleGAN

Cross-lingual voice conversion aims to change source speaker's voice to sound like that of target speaker, when source and target speakers speak different languages. It relies on non-parallel training data from two different languages, hence, is more challenging than mono-lingual voice conversion. Previous studies on cross-lingual voice conversion mainly focus on spectral conversion with a linear transformation for F0 transfer. However, as an important prosodic factor, F0 is inherently hierarchical, thus it is insufficient to just use a linear method for conversion. We propose the use of conti...

VAW-GAN for Singing Voice Conversion with Non-parallel Training Data

Singing voice conversion aims to convert singer's voice from source to target without changing singing content. Parallel training data is typically required for the training of singing voice conversion system, that is however not practical in real-life applications. Recent encoder-decoder structures, such as variational autoencoding Wasserstein generative adversarial network (VAW-GAN), provide an effective way to learn a mapping through non-parallel training data. In this paper, we propose a singing voice conversion framework that is based on VAW-GAN. We train an encoder to disentangle singer ...

An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning

Speaker identity is one of the important characteristics of human speech. In voice conversion, we change the speaker identity from one to another, while keeping the linguistic content unchanged. Voice conversion involves multiple speech processing techniques, such as speech analysis, spectral conversion, prosody conversion, speaker characterization, and vocoding. With the recent advances in theory and practice, we are now able to produce human-like voice quality with high speaker similarity. In this paper, we provide a comprehensive overview of the state-of-the-art of voice conversion techniqu...

Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion

Emotional voice conversion aims to convert the emotion of the speech from one state to another while preserving the linguistic content and speaker identity. The prior studies on emotional voice conversion are mostly carried out under the assumption that emotion is speaker-dependent. We believe that emotions are expressed universally across speakers, therefore, the speaker-independent mapping between emotional states of speech is possible. In this paper, we propose to build a speaker-independent emotional voice conversion framework, that can convert anyone's emotion without the need for paralle...

Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data

Emotional voice conversion is to convert the spectrum and prosody to change the emotional patterns of speech, while preserving the speaker identity and linguistic content. Many studies require parallel speech data between different emotional patterns, which is not practical in real life. Moreover, they often model the conversion of fundamental frequency (F0) with a simple linear transform. As F0 is a key aspect of intonation that is hierarchical in nature, we believe that it is more adequate to model F0 in different temporal scales by using wavelet transform. We propose a CycleGAN network to f...

Black-box Attacks on Automatic Speaker Verification using Feedback-controlled Voice Conversion

Automatic speaker verification (ASV) systems in practice are greatly vulnerable to spoofing attacks. The latest voice conversion technologies are able to produce perceptually natural sounding speech that mimics any target speakers. However, the perceptual closeness to a speaker's identity may not be enough to deceive an ASV system. In this work, we propose a framework that uses the output scores of an ASV system as the feedback to a voice conversion system. The attacker framework is a black-box adversary that steals one's voice identity, because it does not require any knowledge about the ASV ...

A Modularized Neural Network with Language-Specific Output Layers for Cross-lingual Voice Conversion

This paper presents a cross-lingual voice conversion framework that adopts a modularized neural network. The modularized neural network has a common input structure that is shared for both languages, and two separate output modules, one for each language. The idea is motivated by the fact that phonetic systems of languages are similar because humans share a common vocal production system, but acoustic renderings, such as prosody and phonotactic, vary a lot from language to language. The modularized neural network is trained to map Phonetic PosteriorGram (PPG) to acoustic features for multiple ...

A Vocoder-free WaveNet Voice Conversion with Non-Parallel Data

In a typical voice conversion system, vocoder is commonly used for speech-to-features analysis and features-to-speech synthesis. However, vocoder can be a source of speech quality degradation. This paper presents a vocoder-free voice conversion approach using WaveNet for non-parallel training data. Instead of dealing with the intermediate features, the proposed approach utilizes the WaveNet to map the Phonetic PosteriorGrams (PPGs) to the waveform samples directly. In this way, we avoid the estimation errors caused by vocoder and feature conversion. Additionally, as PPG is assumed to be speake...

Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet

We investigated the training of a shared model for both text-to-speech (TTS) and voice conversion (VC) tasks. We propose using an extended model architecture of Tacotron, that is a multi-source sequence-to-sequence model with a dual attention mechanism as the shared model for both the TTS and VC tasks. This model can accomplish these two different tasks respectively according to the type of input. An end-to-end speech synthesis task is conducted when the model is given text as the input while a sequence-to-sequence voice conversion task is conducted when it is given the speech of a source spea...

Error Reduction Network for DBLSTM-based Voice Conversion

 So far, many of the deep learning approaches for voice conversion produce good quality speech by using a large amount of training data. This paper presents a Deep Bidirectional Long Short-Term Memory (DBLSTM) based voice conversion framework that can work with a limited amount of training data. We propose to implement a DBLSTM based average model that is trained with data from many speakers. Then, we propose to perform adaptation with a limited amount of target data. Last but not least, we propose an error reduction network that can improve the voice conversion quality even further. The propo...