High fidelity zero shot speaker adaptation in text to speech synthesis with denoising diffusion GAN.

0 Người đánh giá. Xếp hạng trung bình 0

Tác giả: Xiangchun Liu, Xuan Ma, Wei Song, Yanghao Zhang, Yi Zhang

Ngôn ngữ: eng

Ký hiệu phân loại: 230.071 Education in Christianity, in Christian theology

Thông tin xuất bản: England : Scientific reports , 2025

Mô tả vật lý:

Bộ sưu tập: NCBI

ID: 232672

Thêm vào giỏ Liên kết toàn văn

Zero-shot speaker adaptation seeks to enable the cloning of voices for previously unseen speakers by leveraging only a few seconds of their speech samples. Nevertheless, existing zero-shot multi-speaker text-to-speech (TTS) systems continue to exhibit significant disparities in the synthesized speech quality and speaker similarity when comparing unseen to seen speakers. To address these challenges and improve synthesized speech quality and speaker similarity for unseen speakers, this study introduces an efficient zero-shot speaker-adaptive TTS model, DiffGAN-ZSTTS. The model is constructed on the FastSpeech2 framework and utilizes a diffusion-based decoder to enhance the model's generalization ability for unseen speaker samples in zero-shot settings. We present the SE-Res2FFT module, which refines the encoder's FFT block by incorporating SE-Res2Net modules in parallel with the multi-head self-attention mechanism, thereby achieving a balanced extraction of local and global features. Furthermore, we introduce the MHSE module, which employs multi-head attention mechanisms to augment the model's capability in representing speaker reference audio features. The model was trained and evaluated using both the AISHELL3 and LibriTTS datasets, providing a comprehensive evaluation of speech synthesis performance across both seen and unseen speaker conditions in Chinese and English. Experimental results indicate that DiffGAN-ZSTTS substantially improves both the synthesized speech quality and speaker similarity. Additionally, we assessed the model's performance on the Baker and VCTK datasets, which are outside the training domain, and the results reveal that the model can successfully perform zero-shot speech synthesis for unseen speakers with only a few seconds of speech, outperforming state-of-the-art models in both speaker similarity and audio quality.

Tạo bộ sưu tập với mã QR