With the recent advancements of deep learning technologies, the performance of voice conversion (VC) in terms of quality and similarity has been significantly improved. However, heavy computations are generally required for deep-learning-based VC systems, which can cause notable latency and thus confine their deployments in real-world applications. Therefore, increasing online computation efficiency has become an important task. In this study, we propose a novel mixture-of-experts (MoE) based VC system. The MoE model uses a gating mechanism to specify optimal weights to feature maps to increase VC performance. In addition, assigning sparse constraints on the gating mechanism can accelerate online computation by skipping the convolution process by zeroing out redundant feature maps. Experimental results show that by specifying suitable sparse constraints, we can effectively increase the online computation efficiency with a notable 70% FLOPs (floating-point operations per second) reduction while improving the VC performance in both objective evaluations and human listening tests.
Conclusion
The main contribution of this study is twofold. First, we confirmed the effectiveness of introducing the DeepMoEs model to accelerate the online computation for the VC task. Based on our experimental results, the proposed MoEVC system can reduce more than 70% of FLOPs without harming and even increasing the quality of converted speech in terms of both naturalness and similarity of converted speech. Second, we present that the MOSNet can be used as an effective learning-based objective evaluator for the VC task. Because it was prohibitive to conduct extensive human listening tests, we decided to use MOSNet to predict MOS scores. We further confirmed that the predicted scores are consistent to the results of human listening tests. Hopefully, the findings of this study can promote the research of model compression and online computation acceleration for VC. In the future, we will test the compatibility of the MoEVC with advanced vocoder systems and learning algorithms.