High-Quality Text-to-Speech Implementation via Active Shallow Diffusion Mechanism.

 0 Người đánh giá. Xếp hạng trung bình 0

Tác giả: Junlin Deng, Yan Deng, Ruihan Hou, Yongqiu Long, Ning Wu

Ngôn ngữ: eng

Ký hiệu phân loại:

Thông tin xuất bản: Switzerland : Sensors (Basel, Switzerland) , 2025

Mô tả vật lý:

Bộ sưu tập: NCBI

ID: 79870

 Denoising diffusion probabilistic models (DDPMs) have proven to be useful in text-to-speech (TTS) tasks
  however, it has been a challenge for traditional diffusion models to carry out real-time processing because of the need for hundreds of sampling steps during the iteration. In this work, a two-stage fast inference and efficient diffusion-based acoustic model of TTS, the Cascaded MixGAN-TTS (CMG-TTS), is proposed to address this problem. An active shallow diffusion mechanism is adopted to divide the CMG-TTS training process into two stages. Specifically, a basic acoustic model in the first stage is trained to provide valuable a priori knowledge for the second stage, and for the underlying acoustic modeling, a mixture combination mechanism-based linguistic encoder is introduced to work with pitch and energy predictors. In the following stage of processing, a post-net is used to optimize the mel-spectrogram reconstruction performance. The CMG-TTS is evaluated on datasets such as the AISHELL3 and LJSpeech, and the experiments show that the CMG-TTS achieves satisfactory results in both subjective and objective evaluation metrics with only one denoising step. Compared to other TTS models based on diffusion modeling, the CMG-TTS obtains a leading score in the real time factor (RTF), and both stages of the CMG-TTS are effective in the ablation studies.
Tạo bộ sưu tập với mã QR

THƯ VIỆN - TRƯỜNG ĐẠI HỌC CÔNG NGHỆ TP.HCM

ĐT: (028) 36225755 | Email: tt.thuvien@hutech.edu.vn

Copyright @2024 THƯ VIỆN HUTECH