End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations. Voice conversion, in which a model has to impersonate a speaker in a recording, is one of those situations. In this paper, we propose Blow, a single-scale normalizing flow using hypernetwork conditioning to perform many-to-many voice conversion between raw audio. Blow is trained end-to-end, with non-parallel data, on a frame-by-frame basis using a single speaker identifier. We show that Blow compares favorably to existing flow-based architectures and other competitive baselines, obtaining equal or better performance in both objective and subjective evaluations. We further assess the impact of its main components with an ablation study, and quantify a number of properties such as the necessary amount of training data or the preference for source or target speakers.
Conclusion
In this work we put forward the potential of flow-based generative models for raw audio synthesis, and specially for the challenging task of non-parallel voice conversion. We propose Blow, a singlescale hyperconditioned flow that features a many-block structure with shared embeddings and performs conversion in a forward-backward manner. Because Blow departs from existing flow-based generative models in these aspects, it is able to outperform those and compete with, or even improve upon, existing non-parallel voice conversion systems. We also quantify the impact of the proposed improvements and assess the effect that the amount of training data and the selection of source/target speaker can have in the final result. As future work, we want to improve the model to see if we can deal with other tasks such as speech enhancement or instrument conversion, perhaps by further enhancing the hyperconditioning mechanism or, simply, by tuning its structure or hyperparameters.