TY - GEN
T1 - CONTENT BASED SINGING VOICE SOURCE SEPARATION VIA STRONG CONDITIONING USING ALIGNED PHONEMES
AU - Meseguer-Brocal, Gabriel
AU - Peeters, Geoffroy
N1 - Publisher Copyright:
© Gabriel Meseguer-Brocal, Geoffroy Peeters.
PY - 2020/1/1
Y1 - 2020/1/1
N2 - Informed source separation has recently gained renewed interest with the introduction of neural networks and the availability of large multitrack datasets containing both the mixture and the separated sources. These approaches use prior information about the target source to improve separation. Historically, Music Information Retrieval researchers have focused primarily on score-informed source separation, but more recent approaches explore lyrics-informed source separation. However, because of the lack of multitrack datasets with time-aligned lyrics, models use weak conditioning with non-aligned lyrics. In this paper, we present a multimodal multitrack dataset with lyrics aligned in time at the word level with phonetic information as well as explore strong conditioning using the aligned phonemes. Our model follows a U-Net architecture and takes as input both the magnitude spectrogram of a musical mixture and a matrix with aligned phonetic information. The phoneme matrix is embedded to obtain the parameters that control Feature-wise Linear Modulation (FiLM) layers. These layers condition the U-Net feature maps to adapt the separation process to the presence of different phonemes via affine transformations. We show that phoneme conditioning can be successfully applied to improve singing voice source separation.
AB - Informed source separation has recently gained renewed interest with the introduction of neural networks and the availability of large multitrack datasets containing both the mixture and the separated sources. These approaches use prior information about the target source to improve separation. Historically, Music Information Retrieval researchers have focused primarily on score-informed source separation, but more recent approaches explore lyrics-informed source separation. However, because of the lack of multitrack datasets with time-aligned lyrics, models use weak conditioning with non-aligned lyrics. In this paper, we present a multimodal multitrack dataset with lyrics aligned in time at the word level with phonetic information as well as explore strong conditioning using the aligned phonemes. Our model follows a U-Net architecture and takes as input both the magnitude spectrogram of a musical mixture and a matrix with aligned phonetic information. The phoneme matrix is embedded to obtain the parameters that control Feature-wise Linear Modulation (FiLM) layers. These layers condition the U-Net feature maps to adapt the separation process to the presence of different phonemes via affine transformations. We show that phoneme conditioning can be successfully applied to improve singing voice source separation.
UR - https://www.scopus.com/pages/publications/85164757798
M3 - Conference contribution
AN - SCOPUS:85164757798
T3 - Proceedings of the 21st International Society for Music Information Retrieval Conference, ISMIR 2020
SP - 109
EP - 116
BT - Proceedings of the 21st International Society for Music Information Retrieval Conference, ISMIR 2020
A2 - Cumming, Julie
A2 - Lee, Jin Ha
A2 - McFee, Brian
A2 - Schedl, Markus
A2 - Devaney, Johanna
A2 - Devaney, Johanna
A2 - McKay, Cory
A2 - Zangerle, Eva
A2 - de Reuse, Timothy
PB - International Society for Music Information Retrieval
T2 - 21st International Society for Music Information Retrieval Conference, ISMIR 2020
Y2 - 11 October 2020 through 16 October 2020
ER -