Abstract
Target Speaker Extraction (TSE) aims to capture a desired voice among other interfering ones and/or background noise using a reference excerpt acquired during the enrollment phase. While useful in many applications, existing TSE systems cannot handle the scenario where several voices need to be enrolled and/or targeted. In this work, we address this new task, called multi-target speaker extraction (MTSE), which consists of extracting multiple target speakers in a mixture, possibly involving other interfering voices, using multiple speaker embeddings. Such models can thus be used by multiple users without the re-enrollment necessity. We propose a curriculum learning scheme to adapt well-known TSE models to the MTSE task. We prove its effectiveness and obtain, for the first time, successful MTSE results on meeting-type excerpts. We also present single-speaker TSE results with multiple enrolled speakers, proving the robustness and versatility of our solution.
| Original language | English |
|---|---|
| Pages (from-to) | 2970-2974 |
| Number of pages | 5 |
| Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
| DOIs | |
| Publication status | Published - 1 Jan 2025 |
| Event | 26th Interspeech Conference 2025 - Rotterdam, Netherlands Duration: 17 Aug 2025 → 21 Aug 2025 |
Keywords
- Personalized Speech Enhancement
- Target Speaker Extraction