TY - GEN
T1 - Multimodal active speaker detection using cross-Attention and contextual information
AU - Mocanu, Bogdan
AU - Tapu, Ruxandra
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024/1/1
Y1 - 2024/1/1
N2 - An active speaker detection (ASD) framework is aimed to identify whether an on-screen person is speaking or not in each frame of the video. In this paper, we introduce a novel ASD system by mindful integration of audio-video cues through a cross-Attention module to capture inter-modal information while retaining the distinct intra-modal features. Furthermore, the system models the inter-speaker relations between the speakers within the same scene. The experimental evaluation validates the effectiveness of the approach, achieving an average mAP score of 94.8%.
AB - An active speaker detection (ASD) framework is aimed to identify whether an on-screen person is speaking or not in each frame of the video. In this paper, we introduce a novel ASD system by mindful integration of audio-video cues through a cross-Attention module to capture inter-modal information while retaining the distinct intra-modal features. Furthermore, the system models the inter-speaker relations between the speakers within the same scene. The experimental evaluation validates the effectiveness of the approach, achieving an average mAP score of 94.8%.
KW - contextual speaker relations
KW - cross-Attention block
KW - multimodal active speaker detection
U2 - 10.1109/ICCE59016.2024.10444380
DO - 10.1109/ICCE59016.2024.10444380
M3 - Conference contribution
AN - SCOPUS:85186972891
T3 - Digest of Technical Papers - IEEE International Conference on Consumer Electronics
BT - 2024 IEEE International Conference on Consumer Electronics, ICCE 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE International Conference on Consumer Electronics, ICCE 2024
Y2 - 6 January 2024 through 8 January 2024
ER -