TY - GEN
T1 - GLA-GRAD
T2 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024
AU - Liu, Haocheng
AU - Baoueb, Teysir
AU - Fontaine, Mathieu
AU - Le Roux, Jonathan
AU - Richard, Gaël
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024/1/1
Y1 - 2024/1/1
N2 - Diffusion models are receiving a growing interest for a variety of signal generation tasks such as speech or music synthesis. WaveGrad, for example, is a successful diffusion model that conditionally uses the mel spectrogram to guide a diffusion process for the generation of high-fidelity audio. However, such models face important challenges concerning the noise diffusion process for training and inference, and they have difficulty generating high-quality speech for speakers that were not seen during training. With the aim of minimizing the conditioning error and increasing the efficiency of the noise diffusion process, we propose in this paper a new scheme called GLA-Grad, which consists in introducing a phase recovery algorithm such as the Griffin-Lim algorithm (GLA) at each step of the regular diffusion process. Furthermore, it can be directly applied to an already-trained waveform generation model, without additional training or fine-tuning. We show that our algorithm outperforms state-of-the-art diffusion models for speech generation, especially when generating speech for a previously unseen target speaker.
AB - Diffusion models are receiving a growing interest for a variety of signal generation tasks such as speech or music synthesis. WaveGrad, for example, is a successful diffusion model that conditionally uses the mel spectrogram to guide a diffusion process for the generation of high-fidelity audio. However, such models face important challenges concerning the noise diffusion process for training and inference, and they have difficulty generating high-quality speech for speakers that were not seen during training. With the aim of minimizing the conditioning error and increasing the efficiency of the noise diffusion process, we propose in this paper a new scheme called GLA-Grad, which consists in introducing a phase recovery algorithm such as the Griffin-Lim algorithm (GLA) at each step of the regular diffusion process. Furthermore, it can be directly applied to an already-trained waveform generation model, without additional training or fine-tuning. We show that our algorithm outperforms state-of-the-art diffusion models for speech generation, especially when generating speech for a previously unseen target speaker.
KW - Diffusion models
KW - Griffin-Lim algorithm
KW - domain adaptation
KW - speech generation
UR - https://www.scopus.com/pages/publications/85195366180
U2 - 10.1109/ICASSP48485.2024.10446058
DO - 10.1109/ICASSP48485.2024.10446058
M3 - Conference contribution
AN - SCOPUS:85195366180
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 11611
EP - 11615
BT - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 14 April 2024 through 19 April 2024
ER -