V2S attack: building DNN-based voice conversion from automatic speaker verification

This paper presents a new voice impersonation attack using voice conversion (VC). Enrolling personal voices for automatic speaker verification (ASV) offers natural and flexible biometric authentication systems. Basically, the ASV systems do not include the users' voice data. However, if the ASV system is unexpectedly exposed and hacked by a malicious attacker, there is a risk that the attacker will use VC techniques to reproduce the enrolled user's voices. We name this the verification-to-synthesis (V2S) attack'' and propose VC training with the ASV and pre-trained automatic speech recognition (ASR) models and without the targeted speaker's voice data. The VC model reproduces the targeted speaker's individuality by deceiving the ASV model and restores phonetic property of an input voice by matching phonetic posteriorgrams predicted by the ASR model. The experimental evaluation compares converted voices between the proposed method that does not use the targeted speaker's voice data and the standard VC that uses the data. The experimental results demonstrate that the proposed method performs comparably to the existing VC methods that trained using a very small amount of parallel voice data.

Conclusion

This paper presents a new voice impersonation attack using voice conversion (VC), named the verification-to-synthesis (V2S) attack. The VC model was trained to deceive the white-boxed automatic speaker verification (ASV) model for reproducing the targeted speaker’s individuality and to restore phonetic property of the input voice by using pre-trained automatic speech recognition (ASR) model. The experimental results indicated that the proposed V2S attack can synthesize voice that has naturalness and speaker individuality comparable to an existing parallel VC with a very small amount of training data. In future work, we will evaluate the V2S attack that uses pre-stored speaker’s voice data and investigate the dependency of the input speaker in our method.

Source