Abstract

Recently, more and more zero-shot voice conversion algorithms have been proposed. As a fundamental part of zero-shot voice conversion, speaker embeddings are the key to improve the speaker similarity of the converted speech. In this paper, we study the impact of speaker embeddings on zero-shot voice conversion performance. To represent the characteristics of target speaker better and improve the speaker similarity in zero-shot voice conversion, we propose a novel speaker representation method in this paper. Our method combines the advantages of D-vector and global style token (GST) based speaker representation. Objective and subjective evaluations show that the proposed method achieves a decent performance on zero-shot voice conversion and improves the speaker similarity significantly over D-vector and GST based speaker embedding.