TY - GEN
T1 - MULTILINGUAL LYRICS-TO-AUDIO ALIGNMENT
AU - Vaglio, Andrea
AU - Hennequin, Romain
AU - Moussallam, Manuel
AU - Richard, Gaël
AU - D’Alché-Buc, Florence
N1 - Publisher Copyright:
© A. Vaglio, R. Hennequin, M. Moussallam, G. Richard, and F. d’Alché-Buc.
PY - 2020/1/1
Y1 - 2020/1/1
N2 - Lyrics-to-audio alignment methods have recently reported impressive results, opening the door to practical applications such as karaoke and within song navigation. However, most studies focus on a single language - usually English - for which annotated data are abundant. The question of their ability to generalize to other languages, especially in low (or even zero) training resource scenarios has been so far left unexplored. In this paper, we address the lyrics-to-audio alignment task in a generalized multilingual setup. More precisely, this investigation presents the first (to the best of our knowledge) attempt to create a language-independent lyrics-to-audio alignment system. Building on a Recurrent Neural Network (RNN) model trained with a Connectionist Temporal Classification (CTC) algorithm, we study the relevance of different intermediate representations, either character or phoneme, along with several strategies to design a training set. The evaluation is conducted on multiple languages with a varying amount of data available, from plenty to zero. Results show that learning from diverse data and using a universal phoneme set as an intermediate representation yield the best generalization performances.
AB - Lyrics-to-audio alignment methods have recently reported impressive results, opening the door to practical applications such as karaoke and within song navigation. However, most studies focus on a single language - usually English - for which annotated data are abundant. The question of their ability to generalize to other languages, especially in low (or even zero) training resource scenarios has been so far left unexplored. In this paper, we address the lyrics-to-audio alignment task in a generalized multilingual setup. More precisely, this investigation presents the first (to the best of our knowledge) attempt to create a language-independent lyrics-to-audio alignment system. Building on a Recurrent Neural Network (RNN) model trained with a Connectionist Temporal Classification (CTC) algorithm, we study the relevance of different intermediate representations, either character or phoneme, along with several strategies to design a training set. The evaluation is conducted on multiple languages with a varying amount of data available, from plenty to zero. Results show that learning from diverse data and using a universal phoneme set as an intermediate representation yield the best generalization performances.
M3 - Conference contribution
AN - SCOPUS:85183895436
T3 - Proceedings of the 21st International Society for Music Information Retrieval Conference, ISMIR 2020
SP - 512
EP - 519
BT - Proceedings of the 21st International Society for Music Information Retrieval Conference, ISMIR 2020
A2 - Cumming, Julie
A2 - Lee, Jin Ha
A2 - McFee, Brian
A2 - Schedl, Markus
A2 - Devaney, Johanna
A2 - Devaney, Johanna
A2 - McKay, Cory
A2 - Zangerle, Eva
A2 - de Reuse, Timothy
PB - International Society for Music Information Retrieval
T2 - 21st International Society for Music Information Retrieval Conference, ISMIR 2020
Y2 - 11 October 2020 through 16 October 2020
ER -