TY - CHAP
T1 - ADDING TEMPORAL MUSICAL CONTROLS ON TOP OF PRETRAINED GENERATIVE MODELS
AU - Nabi, Sarah
AU - Demerlé, Nils
AU - Peeters, Geoffroy
AU - Bevilacqua, Frédéric
AU - Esling, Philippe
N1 - Publisher Copyright:
© S. Nabi, N. Demerlé, G. Peeters, F. Bevilacqua and P. Esling.
PY - 2025/1/1
Y1 - 2025/1/1
N2 - Recent advances in deep generative modeling have enabled high-quality models for musical audio synthesis. However, these approaches remain difficult to control, confined to simple, static attributes and, most importantly, entail retraining a different computationally-heavy architecture for each new control. This is inefficient and impractical as it requires substantial computational resources. In this paper, we propose a novel approach allowing to add time-varying musical controls on top of any pretrained generative models with an exposed latent space (e.g. neural audio codecs), without retraining or finetuning. Our method supports both discrete and continuous attributes by adapting a rectified flow approach with a latent diffusion transformer. We learn an invertible mapping between pretrained latent variables and a new space disentangling explicit control attributes and style variables that capture the remaining factors of variation. This enables both feature extraction from an input, but also editing those features to generate transformed audio samples. Finally, this also introduces the ability to perform synthesis directly from the audio descriptors. We validate our method with 4 datasets going from different musical instruments up to full music recordings, on which we outperform state-of-the-art taskspecific baselines in terms of both generation quality and accuracy of the control by inferring transferred attributes. Our code is available on the supporting webpage.
AB - Recent advances in deep generative modeling have enabled high-quality models for musical audio synthesis. However, these approaches remain difficult to control, confined to simple, static attributes and, most importantly, entail retraining a different computationally-heavy architecture for each new control. This is inefficient and impractical as it requires substantial computational resources. In this paper, we propose a novel approach allowing to add time-varying musical controls on top of any pretrained generative models with an exposed latent space (e.g. neural audio codecs), without retraining or finetuning. Our method supports both discrete and continuous attributes by adapting a rectified flow approach with a latent diffusion transformer. We learn an invertible mapping between pretrained latent variables and a new space disentangling explicit control attributes and style variables that capture the remaining factors of variation. This enables both feature extraction from an input, but also editing those features to generate transformed audio samples. Finally, this also introduces the ability to perform synthesis directly from the audio descriptors. We validate our method with 4 datasets going from different musical instruments up to full music recordings, on which we outperform state-of-the-art taskspecific baselines in terms of both generation quality and accuracy of the control by inferring transferred attributes. Our code is available on the supporting webpage.
UR - https://www.scopus.com/pages/publications/105025363481
U2 - 10.5281/zenodo.17706592
DO - 10.5281/zenodo.17706592
M3 - Chapter
AN - SCOPUS:105025363481
T3 - Proceedings of the International Society for Music Information Retrieval Conference
SP - 793
EP - 800
BT - Proceedings of the International Society for Music Information Retrieval Conference
PB - International Society for Music Information Retrieval
ER -