Résumé
Target Speaker Extraction (TSE) aims to capture a desired voice among other interfering ones and/or background noise using a reference excerpt acquired during the enrollment phase. While useful in many applications, existing TSE systems cannot handle the scenario where several voices need to be enrolled and/or targeted. In this work, we address this new task, called multi-target speaker extraction (MTSE), which consists of extracting multiple target speakers in a mixture, possibly involving other interfering voices, using multiple speaker embeddings. Such models can thus be used by multiple users without the re-enrollment necessity. We propose a curriculum learning scheme to adapt well-known TSE models to the MTSE task. We prove its effectiveness and obtain, for the first time, successful MTSE results on meeting-type excerpts. We also present single-speaker TSE results with multiple enrolled speakers, proving the robustness and versatility of our solution.
| langue originale | Anglais |
|---|---|
| Pages (de - à) | 2970-2974 |
| Nombre de pages | 5 |
| journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
| Les DOIs | |
| état | Publié - 1 janv. 2025 |
| Evénement | 26th Interspeech Conference 2025 - Rotterdam, Pays-Bas Durée: 17 août 2025 → 21 août 2025 |
Empreinte digitale
Examiner les sujets de recherche de « MTSE: Multi-Target Speaker Extraction for Conversation Scenarios ». Ensemble, ils forment une empreinte digitale unique.Contient cette citation
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver