AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder

  • Samir Sadok
  • , Simon Leglaive
  • , Laurent Girin
  • , Gaël Richard
  • , Xavier Alameda-Pineda

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This article introduces AnCoGen, a novel method that leverages a masked autoencoder to unify the analysis, control, and generation of speech signals within a single model. AnCoGen can analyze speech by estimating key attributes, such as speaker identity, pitch, content, loudness, signal-to-noise ratio, and clarity index. In addition, it can generate speech from these attributes and allow precise control of the synthesized speech by modifying them. Extensive experiments demonstrated the effectiveness of AnCoGen across speech analysis-resynthesis, pitch estimation, pitch modification, and speech enhancement. Code and audio examples are available online.

Original languageEnglish
Title of host publication2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Proceedings
EditorsBhaskar D Rao, Isabel Trancoso, Gaurav Sharma, Neelesh B. Mehta
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350368741
DOIs
Publication statusPublished - 1 Jan 2025
Event2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025 - Hyderabad, India
Duration: 6 Apr 202511 Apr 2025

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)1520-6149

Conference

Conference2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025
Country/TerritoryIndia
CityHyderabad
Period6/04/2511/04/25

Keywords

  • Speech analysis/transformation/synthesis
  • masked autoencoder
  • pitch estimation and modification
  • speech enhancement

Fingerprint

Dive into the research topics of 'AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder'. Together they form a unique fingerprint.

Cite this