Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet

Categories: Fourier Transform | Mean Opinion Score | Mel-spectrogram | Recurrent Neural Network | WaveNet

We investigated the training of a shared model for both text-to-speech (TTS) and voice conversion (VC) tasks. We propose using an extended model architecture of Tacotron, that is a multi-source sequence-to-sequence model with a dual attention mechanism as the shared model for both the TTS and VC tasks. This model can accomplish these two different tasks respectively according to the type of input. An end-to-end speech synthesis task is conducted when the model is given text as the input while a sequence-to-sequence voice conversion task is conducted when it is given the speech of a source spea...

Singing voice conversion with non-parallel data

Voice Conversion

Categories: Deep Learning | Fourier Transform | Mean Opinion Score | Mel-spectrogram | Recurrent Neural Network

Tags: 2019 | Jinxi Guo | Ning Xu | Wei Chu | Xin Chen

Singing voice conversion is a task to convert a song sang by a source singer to the voice of a target singer. In this paper, we propose using a parallel data free, many-to-one voice conversion technique on singing voices. A phonetic posterior feature is first generated by decoding singing voices through a robust Automatic Speech Recognition Engine (ASR). Then, a trained Recurrent Neural Network (RNN) with a Deep Bidirectional Long Short Term Memory (DBLSTM) structure is used to model the mapping from person-independent content to the acoustic features of the target person. F0 and aperiodic are...

ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion

Voice Conversion

Categories: Convolutional Neural Network | Fourier Transform | Mel-spectrogram

Tags: 2018 | Hirokazu Kameoka | Kou Tanaka | Nobukatsu Hojo | Takuhiro Kaneko

This paper proposes a voice conversion method based on fully convolutional sequence-to-sequence (seq2seq) learning. The present method, which we call "ConvS2S-VC", learns the mapping between source and target speech feature sequences using a fully convolutional seq2seq model with an attention mechanism. Owing to the nature of seq2seq learning, our method is particularly noteworthy in that it allows the flexible conversion of not only the voice characteristics but also the pitch contour and duration of the input speech. The current model consists of six networks, namely source and target encode...

AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms

Voice Conversion

Categories: Deep Learning | Fourier Transform | Gaussian Mixture Model | Mel-spectrogram

Tags: 2018 | Hirokazu Kameoka | Kou Tanaka | Nobukatsu Hojo | Takuhiro Kaneko

This paper describes a method based on a sequence-to-sequence learning (Seq2Seq) with attention and context preservation mechanism for voice conversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving sequence modeling such as speech synthesis and recognition, machine translation, and image captioning. In contrast to current VC techniques, our method 1) stabilizes and accelerates the training procedure by considering guided attention and proposed context preservation losses, 2) allows not only spectral envelopes but also fundamental frequency contours and durations of speec...

Whispered-to-voiced Alaryngeal Speech Conversion with Generative Adversarial Networks

Voice Conversion

Categories: Adversarial Training | Deep Learning | Generative Adversarial Network

Tags: 2018 | Antonio Bonafonte | Joan Serrà | Jose A. Gonzalez | Santiago Pascual

Most methods of voice restoration for patients suffering from aphonia either produce whispered or monotone speech. Apart from intelligibility, this type of speech lacks expressiveness and naturalness due to the absence of pitch (whispered speech) or artificial generation of it (monotone speech). Existing techniques to restore prosodic information typically combine a vocoder, which parameterises the speech signal, with machine learning techniques that predict prosodic information. In contrast, this paper describes an end-to-end neural approach for estimating a fully-voiced speech waveform from ...

Error Reduction Network for DBLSTM-based Voice Conversion

Voice Conversion

So far, many of the deep learning approaches for voice conversion produce good quality speech by using a large amount of training data. This paper presents a Deep Bidirectional Long Short-Term Memory (DBLSTM) based voice conversion framework that can work with a limited amount of training data. We propose to implement a DBLSTM based average model that is trained with data from many speakers. Then, we propose to perform adaptation with a limited amount of target data. Last but not least, we propose an error reduction network that can improve the voice conversion quality even further. The propo...

A Spoofing Benchmark for the 2018 Voice Conversion Challenge: Leveraging from Spoofing Countermeasures for Speech Artifact Assessment

Voice Conversion

Categories: Gaussian Mixture Model

Voice conversion (VC) aims at conversion of speaker characteristic without altering content. Due to training data limitations and modeling imperfections, it is difficult to achieve believable speaker mimicry without introducing processing artifacts; performance assessment of VC, therefore, usually involves both speaker similarity and quality evaluation by a human panel. As a time-consuming, expensive, and non-reproducible process, it hinders rapid prototyping of new VC technology. We address artifact assessment using an alternative, objective approach leveraging from prior work on spoofing cou...

Voice Conversion Based on Cross-Domain Features Using Variational Auto Encoders

Voice Conversion

Categories: Autoencoder | Deep Learning

An effective approach to non-parallel voice conversion (VC) is to utilize deep neural networks (DNNs), specifically variational auto encoders (VAEs), to model the latent structure of speech in an unsupervised manner. A previous study has confirmed the effectiveness of VAE using the STRAIGHT spectra for VC. However, VAE using other types of spectral features such as melcepstral coefficients (MCCs), which are related to human perception and have been widely used in VC, have not been properly investigated. Instead of using one specific type of spectral feature, it is expected that VAE may benefit...

ACVAE-VC: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder

Voice Conversion

Categories: Autoencoder | Convolutional Neural Network | Deep Learning | Recurrent Neural Network

Tags: 2018 | Hirokazu Kameoka | Kou Tanaka | Nobukatsu Hojo | Takuhiro Kaneko

This paper proposes a non-parallel many-to-many voice conversion (VC) method using a variant of the conditional variational autoencoder (VAE) called an auxiliary classifier VAE (ACVAE). The proposed method has three key features. First, it adopts fully convolutional architectures to construct the encoder and decoder networks so that the networks can learn conversion rules that capture time dependencies in the acoustic feature sequences of source and target speech. Second, it uses an information-theoretic regularization for the model training to ensure that the information in the attribute clas...

Voice Conversion with Conditional SampleRNN

Voice Conversion

Categories: Deep Learning | Recurrent Neural Network

Here we present a novel approach to conditioning the SampleRNN generative model for voice conversion (VC). Conventional methods for VC modify the perceived speaker identity by converting between source and target acoustic features. Our approach focuses on preserving voice content and depends on the generative network to learn voice style. We first train a multi-speaker SampleRNN model conditioned on linguistic features, pitch contour, and speaker identity using a multi-speaker speech corpus. Voice-converted speech is generated using linguistic features and pitch contour extracted from the sour...

Articles