MTSE: Multi-Target Speaker Extraction for Conversation Scenarios

  • Thomas Serre
  • , Mathieu Fontaine
  • , Eric Benhaim
  • , Slim Essid

Research output: Contribution to journalConference articlepeer-review

Abstract

Target Speaker Extraction (TSE) aims to capture a desired voice among other interfering ones and/or background noise using a reference excerpt acquired during the enrollment phase. While useful in many applications, existing TSE systems cannot handle the scenario where several voices need to be enrolled and/or targeted. In this work, we address this new task, called multi-target speaker extraction (MTSE), which consists of extracting multiple target speakers in a mixture, possibly involving other interfering voices, using multiple speaker embeddings. Such models can thus be used by multiple users without the re-enrollment necessity. We propose a curriculum learning scheme to adapt well-known TSE models to the MTSE task. We prove its effectiveness and obtain, for the first time, successful MTSE results on meeting-type excerpts. We also present single-speaker TSE results with multiple enrolled speakers, proving the robustness and versatility of our solution.

Original languageEnglish
Pages (from-to)2970-2974
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
Publication statusPublished - 1 Jan 2025
Event26th Interspeech Conference 2025 - Rotterdam, Netherlands
Duration: 17 Aug 202521 Aug 2025

Keywords

  • Personalized Speech Enhancement
  • Target Speaker Extraction

Fingerprint

Dive into the research topics of 'MTSE: Multi-Target Speaker Extraction for Conversation Scenarios'. Together they form a unique fingerprint.

Cite this