TY - GEN
T1 - ADAPTING PITCH-BASED SELF SUPERVISED LEARNING MODELS FOR TEMPO ESTIMATION
AU - Gagneré, Antonin
AU - Essid, Slim
AU - Peeters, Geoffroy
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024/1/1
Y1 - 2024/1/1
N2 - Tempo estimation is the task of estimating the periodicity of the dominant rhythm pulse of a music audio signal. It has therefore a close relationship with dominant pitch estimation. Recently, both tasks have been addressed in a Self-Supervised Learning (SSL) fashion so as to leverage unlabelled data for training. In this work, we study the applicability of two successful pitch-based SSL models, SPICE and PESTO, for the purpose of tempo estimation. Both successfully exploit Siamese networks with a pitch-shifting view generation between the two branches. To apply these models for tempo estimation, we represent the audio signal by the Constant-Q transform (CQT) of its onset-strength-function and adapt their view generation using time-stretching (instead of pitch shifting), which is efficiently implemented by shifting the CQT. In a large experiment, we show that simply adapting PESTO in this way yields superior results than the previous SSL approach to tempo estimation for most datasets used in the reference benchmark. Further, since PESTO is light-weight, requiring only a few training data, we study a new learning scheme where the downstream datasets are processed directly in a SSL fashion (without access to labels) showing that this is an interesting alternative further improving the performance for some datasets.
AB - Tempo estimation is the task of estimating the periodicity of the dominant rhythm pulse of a music audio signal. It has therefore a close relationship with dominant pitch estimation. Recently, both tasks have been addressed in a Self-Supervised Learning (SSL) fashion so as to leverage unlabelled data for training. In this work, we study the applicability of two successful pitch-based SSL models, SPICE and PESTO, for the purpose of tempo estimation. Both successfully exploit Siamese networks with a pitch-shifting view generation between the two branches. To apply these models for tempo estimation, we represent the audio signal by the Constant-Q transform (CQT) of its onset-strength-function and adapt their view generation using time-stretching (instead of pitch shifting), which is efficiently implemented by shifting the CQT. In a large experiment, we show that simply adapting PESTO in this way yields superior results than the previous SSL approach to tempo estimation for most datasets used in the reference benchmark. Further, since PESTO is light-weight, requiring only a few training data, we study a new learning scheme where the downstream datasets are processed directly in a SSL fashion (without access to labels) showing that this is an interesting alternative further improving the performance for some datasets.
KW - self-supervised-learning
KW - tempo estimation
UR - https://www.scopus.com/pages/publications/85195362625
U2 - 10.1109/ICASSP48485.2024.10447129
DO - 10.1109/ICASSP48485.2024.10447129
M3 - Conference contribution
AN - SCOPUS:85195362625
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 956
EP - 960
BT - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024
Y2 - 14 April 2024 through 19 April 2024
ER -