TY - JOUR
T1 - SINGER IDENTITY REPRESENTATION LEARNING USING SELF-SUPERVISED TECHNIQUES
AU - Torres, Bernardo
AU - Lattner, Stefan
AU - Richard, Gaël
N1 - Publisher Copyright:
© B, Torres, S. Lattner, and G. Richard.
PY - 2023/1/1
Y1 - 2023/1/1
N2 - Significant strides have been made in creating voice identity representations using speech data. However, the same level of progress has not been achieved for singing voices. To bridge this gap, we suggest a framework for training singer identity encoders to extract representations suitable for various singing-related tasks, such as singing voice similarity and synthesis. We explore different selfsupervised learning techniques on a large collection of isolated vocal tracks and apply data augmentations during training to ensure that the representations are invariant to pitch and content variations. We evaluate the quality of the resulting representations on singer similarity and identification tasks across multiple datasets, with a particular emphasis on out-of-domain generalization. Our proposed framework produces high-quality embeddings that outperform both speaker verification and wav2vec 2.0 pre-trained baselines on singing voice while operating at 44.1 kHz. We release our code and trained models to facilitate further research on singing voice and related areas.
AB - Significant strides have been made in creating voice identity representations using speech data. However, the same level of progress has not been achieved for singing voices. To bridge this gap, we suggest a framework for training singer identity encoders to extract representations suitable for various singing-related tasks, such as singing voice similarity and synthesis. We explore different selfsupervised learning techniques on a large collection of isolated vocal tracks and apply data augmentations during training to ensure that the representations are invariant to pitch and content variations. We evaluate the quality of the resulting representations on singer similarity and identification tasks across multiple datasets, with a particular emphasis on out-of-domain generalization. Our proposed framework produces high-quality embeddings that outperform both speaker verification and wav2vec 2.0 pre-trained baselines on singing voice while operating at 44.1 kHz. We release our code and trained models to facilitate further research on singing voice and related areas.
UR - https://www.scopus.com/pages/publications/85219530173
M3 - Article
AN - SCOPUS:85219530173
SN - 3006-3094
VL - 2023
SP - 448
EP - 456
JO - Proceedings of the International Society for Music Information Retrieval Conference
JF - Proceedings of the International Society for Music Information Retrieval Conference
ER -