TY - GEN
T1 - VQCPC-GAN
T2 - 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2021
AU - Nistal, Javier
AU - Aouameur, Cyran
AU - Lattner, Stefan
AU - Richard, Gael
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021/1/1
Y1 - 2021/1/1
N2 - Influenced by the field of Computer Vision, Generative Adversarial Networks (GANs) are often adopted for the audio domain using fixed-size two-dimensional spectrogram representations as the 'image data'. However, in the (musical) audio domain, it is often desired to generate output of variable duration. This paper presents VQCPC-GAN, an adversarial framework for synthesizing variable-length audio by exploiting Vector-Quantized Contrastive Predictive Coding (VQCPC). A sequence of VQCPC tokens extracted from real audio data serves as conditional input to a GAN architecture, providing step-wise time-dependent features of the generated content. The input noise z (characteristic in adversarial architectures) remains fixed over time, ensuring temporal consistency of global features. We evaluate the proposed model by comparing a diverse set of metrics against various strong baselines. Results show that, even though the baselines score best, VQCPC-GAN achieves comparable performance even when generating variable-length audio. Numerous sound examples are provided in the accompanying website,11sonycslparis.github.io/vqcpc-gan.io and we release the code for reproducibility.22github.com/SonyCSLParis/vqcpc-gan.
AB - Influenced by the field of Computer Vision, Generative Adversarial Networks (GANs) are often adopted for the audio domain using fixed-size two-dimensional spectrogram representations as the 'image data'. However, in the (musical) audio domain, it is often desired to generate output of variable duration. This paper presents VQCPC-GAN, an adversarial framework for synthesizing variable-length audio by exploiting Vector-Quantized Contrastive Predictive Coding (VQCPC). A sequence of VQCPC tokens extracted from real audio data serves as conditional input to a GAN architecture, providing step-wise time-dependent features of the generated content. The input noise z (characteristic in adversarial architectures) remains fixed over time, ensuring temporal consistency of global features. We evaluate the proposed model by comparing a diverse set of metrics against various strong baselines. Results show that, even though the baselines score best, VQCPC-GAN achieves comparable performance even when generating variable-length audio. Numerous sound examples are provided in the accompanying website,11sonycslparis.github.io/vqcpc-gan.io and we release the code for reproducibility.22github.com/SonyCSLParis/vqcpc-gan.
KW - Audio Synthesis
KW - Generative Adversarial Networks
KW - Vector-Quantized Contrastive Predictive Coding
UR - https://www.scopus.com/pages/publications/85113403037
U2 - 10.1109/WASPAA52581.2021.9632757
DO - 10.1109/WASPAA52581.2021.9632757
M3 - Conference contribution
AN - SCOPUS:85113403037
T3 - IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
SP - 116
EP - 120
BT - 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2021
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 17 October 2021 through 20 October 2021
ER -