TY - GEN
T1 - A Lightweight Dual-Stage Framework for Personalized Speech Enhancement Based on Deepfilternet2
AU - Serre, Thomas
AU - Fontaine, Mathieu
AU - Benhaim, Éric
AU - Dutour, Geoffroy
AU - Essid, Slim
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024/1/1
Y1 - 2024/1/1
N2 - Isolating the desired speaker's voice amidst multiple speakers in a noisy acoustic context is a challenging task. Personalized speech enhancement (PSE) endeavours to achieve this by leveraging prior knowledge of the speaker's voice. Recent research efforts have yielded promising PSE models, albeit often accompanied by computationally intensive architectures, unsuitable for resource-constrained embedded devices. In this paper, we introduce a novel method to personalize a lightweight dual-stage Speech Enhancement (SE) model and implement it within DeepFilterNet2, a SE model renowned for its state-of-the-art performance. We seek an optimal integration of speaker information within the model, exploring different positions for the integration of the speaker embeddings within the dual-stage enhancement architecture. We also investigate a tailored training strategy when adapting DeepFilterNet2 to a PSE task. We show that our personalization method greatly improves the performances of DeepFilterNet2 while preserving minimal computational overhead.
AB - Isolating the desired speaker's voice amidst multiple speakers in a noisy acoustic context is a challenging task. Personalized speech enhancement (PSE) endeavours to achieve this by leveraging prior knowledge of the speaker's voice. Recent research efforts have yielded promising PSE models, albeit often accompanied by computationally intensive architectures, unsuitable for resource-constrained embedded devices. In this paper, we introduce a novel method to personalize a lightweight dual-stage Speech Enhancement (SE) model and implement it within DeepFilterNet2, a SE model renowned for its state-of-the-art performance. We seek an optimal integration of speaker information within the model, exploring different positions for the integration of the speaker embeddings within the dual-stage enhancement architecture. We also investigate a tailored training strategy when adapting DeepFilterNet2 to a PSE task. We show that our personalization method greatly improves the performances of DeepFilterNet2 while preserving minimal computational overhead.
KW - Target speech extraction
KW - real-time
KW - speech enhancement
UR - https://www.scopus.com/pages/publications/85202433557
U2 - 10.1109/ICASSPW62465.2024.10627424
DO - 10.1109/ICASSPW62465.2024.10627424
M3 - Conference contribution
AN - SCOPUS:85202433557
T3 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings
SP - 780
EP - 784
BT - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops, ICASSPW 2024
Y2 - 14 April 2024 through 19 April 2024
ER -