Nonparallel Voice Conversion with Augmented Classifier Star Generative Adversarial Networks

Tags: 2020 | Hirokazu Kameoka | Kou Tanaka | Nobukatsu Hojo | Takuhiro Kaneko

We have previously proposed a method that allows for non-parallel voice conversion (VC) by using a variant of generative adversarial networks (GANs) called StarGAN. The main features of our method, called StarGAN-VC, are as follows: First, it requires no parallel utterances, transcriptions, or time alignment procedures for speech generator training. Second, it can simultaneously learn mappings across multiple domains using a single generator network so that it can fully exploit available training data collected from multiple domains to capture latent features that are common to all the domains...

StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion

Voice Conversion

Categories: Adversarial Training | Deep Learning | Generative Adversarial Network

Tags: 2019 | Hirokazu Kameoka | Kou Tanaka | Nobukatsu Hojo | Takuhiro Kaneko

Non-parallel multi-domain voice conversion (VC) is a technique for learning mappings among multiple domains without relying on parallel data. This is important but challenging owing to the requirement of learning multiple mappings and the non-availability of explicit supervision. Recently, StarGAN-VC has garnered attention owing to its ability to solve this problem only using a single generator. However, there is still a gap between real and converted speech. To bridge this gap, we rethink conditional methods of StarGAN-VC, which are key components for achieving non-parallel multi-domain VC in...

Crossmodal Voice Conversion

Voice Conversion

Humans are able to imagine a person's voice from the person's appearance and imagine the person's appearance from his/her voice. In this paper, we make the first attempt to develop a method that can convert speech into a voice that matches an input face image and generate a face image that matches the voice of the input speech by leveraging the correlation between faces and voices. We propose a model, consisting of a speech converter, a face encoder/decoder and a voice encoder. We use the latent code of an input face image encoded by the face encoder as the auxiliary input into the speech conv...

CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion

Voice Conversion

Tags: 2019 | Hirokazu Kameoka | Kou Tanaka | Nobukatsu Hojo | Takuhiro Kaneko

Non-parallel voice conversion (VC) is a technique for learning the mapping from source to target speech without relying on parallel data. This is an important task, but it has been challenging due to the disadvantages of the training conditions. Recently, CycleGAN-VC has provided a breakthrough and performed comparably to a parallel VC method without relying on any extra data, modules, or time alignment procedures. However, there is still a large gap between the real target and converted speech, and bridging this gap remains a challenge. To reduce this gap, we propose CycleGAN-VC2, which is an...

ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion

Voice Conversion

Categories: Convolutional Neural Network | Fourier Transform | Mel-spectrogram

Tags: 2018 | Hirokazu Kameoka | Kou Tanaka | Nobukatsu Hojo | Takuhiro Kaneko

This paper proposes a voice conversion method based on fully convolutional sequence-to-sequence (seq2seq) learning. The present method, which we call "ConvS2S-VC", learns the mapping between source and target speech feature sequences using a fully convolutional seq2seq model with an attention mechanism. Owing to the nature of seq2seq learning, our method is particularly noteworthy in that it allows the flexible conversion of not only the voice characteristics but also the pitch contour and duration of the input speech. The current model consists of six networks, namely source and target encode...

AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms

Voice Conversion

Categories: Deep Learning | Fourier Transform | Gaussian Mixture Model | Mel-spectrogram

Tags: 2018 | Hirokazu Kameoka | Kou Tanaka | Nobukatsu Hojo | Takuhiro Kaneko

This paper describes a method based on a sequence-to-sequence learning (Seq2Seq) with attention and context preservation mechanism for voice conversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving sequence modeling such as speech synthesis and recognition, machine translation, and image captioning. In contrast to current VC techniques, our method 1) stabilizes and accelerates the training procedure by considering guided attention and proposed context preservation losses, 2) allows not only spectral envelopes but also fundamental frequency contours and durations of speec...

ACVAE-VC: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder

Voice Conversion

Categories: Autoencoder | Convolutional Neural Network | Deep Learning | Recurrent Neural Network

Tags: 2018 | Hirokazu Kameoka | Kou Tanaka | Nobukatsu Hojo | Takuhiro Kaneko

This paper proposes a non-parallel many-to-many voice conversion (VC) method using a variant of the conditional variational autoencoder (VAE) called an auxiliary classifier VAE (ACVAE). The proposed method has three key features. First, it adopts fully convolutional architectures to construct the encoder and decoder networks so that the networks can learn conversion rules that capture time dependencies in the acoustic feature sequences of source and target speech. Second, it uses an information-theoretic regularization for the model training to ensure that the information in the attribute clas...

StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks

Voice Conversion

Categories: Adversarial Training | Deep Learning | Fourier Transform | Generative Adversarial Network | Mel-spectrogram

Tags: 2018 | Hirokazu Kameoka | Kou Tanaka | Nobukatsu Hojo | Takuhiro Kaneko

This paper proposes a method that allows non-parallel many-to-many voice conversion (VC) by using a variant of a generative adversarial network (GAN) called StarGAN. Our method, which we call StarGAN-VC, is noteworthy in that it (1) requires no parallel utterances, transcriptions, or time alignment procedures for speech generator training, (2) simultaneously learns many-to-many mappings across different attribute domains using a single generator network, (3) is able to generate converted speech signals quickly enough to allow real-time implementations and (4) requires only several minutes of t...

Tag: Kou Tanaka