Mel-frequency filter bank (MFB) based approaches have the advantage of learning speech compared to raw spectrum since MFB has less feature size. However, speech generator with MFB approaches require additional vocoder that needs a huge amount of computation expense for training process. The additional pre/post processing such as MFB and vocoder is not essential to convert real human speech to others. It is possible to only use the raw spectrum along with the phase to generate different style of voices with clear pronunciation. In this regard, we propose a fast and effective approach to convert realistic voices using raw spectrum in a parallel manner. Our transformer-based model architecture which does not have any CNN or RNN layers has shown the advantage of learning fast and solved the limitation of sequential computation of conventional RNN. In this paper, we introduce a vocoder-free end-to-end voice conversion method using transformer network. The presented conversion model can also be used in speaker adaptation for speech recognition. Our approach can convert the source voice to a target voice without using MFB and vocoder. We can get an adapted MFB for speech recognition by multiplying the converted magnitude with phase. We perform our voice conversion experiments on TIDIGITS dataset using the metrics such as naturalness, similarity, and clarity with mean opinion score, respectively.
Conclusion
We proposed a voice transform with self-attention mechanism in a raw spectrum level, while conventional methods use a vocoder in MFB level. MFB-based approaches had the advantage of computational learning convenience compared to raw spectrum. However, speech generator with MFB approaches require vocoder that needs a huge amount of computation expense for training process. With vocoder, it is possible to get better quality of the voice with the synthesis. On the contrary, the problem of complexity due to the extra computation is inevitable. The additional pre/post processing such as MFB and vocoder is not essential to convert real human speech to others. In this paper, we proposed a vocoder-free end-to-end voice conversion method using a fast and efficient transformer network that can convert spectrum in parallel manner. Obtaining the conversion results with raw spectrum without the help of repetitive vocoder had the advantage of using an original phase information to provide the result. We gathered 38 participants and conducted MOS evaluation on the naturalness, similarity and clarity of the converted speech. In the overall speaker average MOS, the scores of our experiment results got 3:40 in naturalness, 3:82 in similarity, and 3:93 in clarity, respectively. Our results showed that the proposed method is possible to transform with good clarity while it is maintaining appropriateness of naturalness and similarity.