On Using Backpropagation for Speech Texture Generation and Voice Conversion

Categories: Convolutional Neural Network | Deep Learning | Fourier Transform | Mel-spectrogram

Tags: 2018 | Jan Chorowski | Rif A. Saurous | Ron J. Weiss | Samy Bengio

Inspired by recent work on neural network image generation which rely on backpropagation towards the network inputs, we present a proof-of-concept system for speech texture synthesis and voice conversion based on two mechanisms: approximate inversion of the representation learned by a speech recognition neural network, and on matching statistics of neuron activations between different source and target utterances. Similar to image texture synthesis and neural style transfer, the system works by optimizing a cost function with respect to the input waveform samples. To this end we use a differen...

Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities

Voice Conversion

Categories: Dynamic Time Warping | Fourier Transform | Mel-spectrogram | Recurrent Neural Network

Tags: 2017 | Hiroshi Saruwatari | Hiroyuki Miyoshi | Shinnosuke Takamichi | Yuki Saito

Voice conversion (VC) using sequence-to-sequence learning of context posterior probabilities is proposed. Conventional VC using shared context posterior probabilities predicts target speech parameters from the context posterior probabilities estimated from the source speech parameters. Although conventional VC can be built from non-parallel data, it is difficult to convert speaker individuality such as phonetic property and speaking rate contained in the posterior probabilities because the source posterior probabilities are directly used for predicting target speech parameters. In this work, w...

Robustness of Voice Conversion Techniques Under Mismatched Conditions

Voice Conversion

Categories: Fourier Transform | Gaussian Mixture Model | Mel-spectrogram

Tags: 2016 | Dipjyoti Paul | Goutam Saha | Md Sahidullah | Monisankha Pal

Most of the existing studies on voice conversion (VC) are conducted in acoustically matched conditions between source and target signal. However, the robustness of VC methods in presence of mismatch remains unknown. In this paper, we report a comparative analysis of different VC techniques under mismatched conditions. The extensive experiments with five different VC techniques on CMU ARCTIC corpus suggest that performance of VC methods substantially degrades in noisy conditions. We have found that bilinear frequency warping with amplitude scaling (BLFWAS) outperforms other methods in most of t...

Dictionary Update for NMF-based Voice Conversion Using an Encoder-Decoder Network

Voice Conversion

Categories: Autoencoder | Dynamic Time Warping | Fourier Transform | Mel-spectrogram

In this paper, we propose a dictionary update method for Nonnegative Matrix Factorization (NMF) with high dimensional data in a spectral conversion (SC) task. Voice conversion has been widely studied due to its potential applications such as personalized speech synthesis and speech enhancement. Exemplar-based NMF (ENMF) emerges as an effective and probably the simplest choice among all techniques for SC, as long as a source-target parallel speech corpus is given. ENMF-based SC systems usually need a large amount of bases (exemplars) to ensure the quality of the converted speech. However, a sma...

High quality voice conversion using prosodic and high-resolution spectral features

Voice Conversion

Categories: Autoencoder | Deep Learning | Dynamic Time Warping | Fourier Transform

Voice conversion methods have advanced rapidly over the last decade. Studies have shown that speaker characteristics are captured by spectral feature as well as various prosodic features. Most existing conversion methods focus on the spectral feature as it directly represents the timbre characteristics, while some conversion methods have focused only on the prosodic feature represented by the fundamental frequency. In this paper, a comprehensive framework using deep neural networks to convert both timbre and prosodic features is proposed. The timbre feature is represented by a high-resolution ...

Reducing one-to-many problem in Voice Conversion by equalizing the formant locations using dynamic frequency warping

Voice Conversion

Categories: Dynamic Time Warping | Fourier Transform | Gaussian Mixture Model | Mel-spectrogram

Tags: 2015 | Seyed Hamidreza Mohammadi

In this study, we investigate a solution to reduce the effect of one-to-many problem in voice conversion. One-to-many problem in VC happens when two very similar speech segments in source speaker have corresponding speech segments in target speaker that are not similar to each other. As a result, the mapper function usually over-smoothes the generated features in order to be similar to both target speech segments. In this study, we propose to equalize the formant location of source-target frame pairs using dynamic frequency warping in order to reduce the complexity. After the conversion, anoth...

Category: Fourier Transform