TY - JOUR
T1 - Speech self-supervised representations benchmarking
T2 - A case for larger probing heads
AU - Zaiem, Salah
AU - Kemiche, Youcef
AU - Parcollet, Titouan
AU - Essid, Slim
AU - Ravanelli, Mirco
N1 - Publisher Copyright:
© 2024
PY - 2025/1/1
Y1 - 2025/1/1
N2 - Self-supervised learning (SSL) leverages large datasets of unlabeled speech to reach impressive performance with reduced amounts of annotated data. The high number of proposed approaches fostered the emergence of comprehensive benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, while the number of considered tasks has been growing, most proposals rely upon a single downstream architecture that maps the frozen SSL representations to the task labels. This study examines how benchmarking results are affected by changes in the probing head architecture. Interestingly, we found that altering the downstream architecture structure leads to significant fluctuations in the performance ranking of the evaluated models. Against common practices in speech SSL benchmarking, we evaluate larger-capacity probing heads, showing their impact on performance, inference costs, generalization, and multi-level feature exploitation.
AB - Self-supervised learning (SSL) leverages large datasets of unlabeled speech to reach impressive performance with reduced amounts of annotated data. The high number of proposed approaches fostered the emergence of comprehensive benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, while the number of considered tasks has been growing, most proposals rely upon a single downstream architecture that maps the frozen SSL representations to the task labels. This study examines how benchmarking results are affected by changes in the probing head architecture. Interestingly, we found that altering the downstream architecture structure leads to significant fluctuations in the performance ranking of the evaluated models. Against common practices in speech SSL benchmarking, we evaluate larger-capacity probing heads, showing their impact on performance, inference costs, generalization, and multi-level feature exploitation.
KW - Representation learning
KW - Self-supervised learning
KW - Speech processing
U2 - 10.1016/j.csl.2024.101695
DO - 10.1016/j.csl.2024.101695
M3 - Article
AN - SCOPUS:85201003433
SN - 0885-2308
VL - 89
JO - Computer Speech and Language
JF - Computer Speech and Language
M1 - 101695
ER -