TY - GEN
T1 - Improving singing voice separation using Deep U-Net and Wave-U-Net with data augmentation
AU - Cohen-Hadria, Alice
AU - Roebel, Axel
AU - Peeters, Geoffroy
N1 - Publisher Copyright:
© 2019 IEEE
PY - 2019/9/1
Y1 - 2019/9/1
N2 - State-of-the-art singing voice separation is based on deep learning making use of CNN structures with skip connections (like U-Net model, Wave-U-Net model, or MSDENSELSTM). A key to the success of these models is the availability of a large amount of training data. In the following study, we are interested in singing voice separation for mono signals and will investigate into comparing the U-Net and the Wave-U-Net that are structurally similar, but work on different input representations. First, we report a few results on variations of the U-Net model. Second, we will discuss the potential of state of the art speech and music transformation algorithms for augmentation of existing data sets and demonstrate that the effect of these augmentations depends on the signal representations used by the model. The results demonstrate a considerable improvement due to the augmentation for both models. But pitch transposition is the most effective augmentation strategy for the U-Net model, while transposition, time stretching, and formant shifting have a much more balanced effect on the Wave-U-Net model. Finally, we compare the two models on the same dataset.
AB - State-of-the-art singing voice separation is based on deep learning making use of CNN structures with skip connections (like U-Net model, Wave-U-Net model, or MSDENSELSTM). A key to the success of these models is the availability of a large amount of training data. In the following study, we are interested in singing voice separation for mono signals and will investigate into comparing the U-Net and the Wave-U-Net that are structurally similar, but work on different input representations. First, we report a few results on variations of the U-Net model. Second, we will discuss the potential of state of the art speech and music transformation algorithms for augmentation of existing data sets and demonstrate that the effect of these augmentations depends on the signal representations used by the model. The results demonstrate a considerable improvement due to the augmentation for both models. But pitch transposition is the most effective augmentation strategy for the U-Net model, while transposition, time stretching, and formant shifting have a much more balanced effect on the Wave-U-Net model. Finally, we compare the two models on the same dataset.
KW - Convolutional neural network
KW - Data augmentation
KW - Singing voice separation
U2 - 10.23919/EUSIPCO.2019.8902810
DO - 10.23919/EUSIPCO.2019.8902810
M3 - Conference contribution
AN - SCOPUS:85075592996
T3 - European Signal Processing Conference
BT - EUSIPCO 2019 - 27th European Signal Processing Conference
PB - European Signal Processing Conference, EUSIPCO
T2 - 27th European Signal Processing Conference, EUSIPCO 2019
Y2 - 2 September 2019 through 6 September 2019
ER -