TY - GEN
T1 - A Lightweight Audio-Visual Speaker Detection System for Assistive Video Captioning
AU - Mocanu, Bogdan
AU - Tapu, Ruxandra
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025/1/1
Y1 - 2025/1/1
N2 - In this paper we introduce a novel dynamic subtitling system aimed at improving media accessibility for individuals with hearing impairments. The system is built upon an end-to-end active speaker detection framework that leverages joint audio and visual cues through multimodal feature integration. The architecture comprises two lightweight convolutional neural networks, each dedicated to a separate modality, with a cross-modal attention mechanism enhancing the interaction between them. This enables precise localization of the active speaker, allowing subtitle segments to be dynamically placed near the speaker's face to enhance readability and contextual alignment. The proposed approach is evaluated both quantitatively and qualitatively. On a benchmark dataset of 3 0 video samples, it achieves a gain of over 0.7 % in mean Average Precision compared to state-of-the-art ASD techniques. To evaluate the perceptual effectiveness of the system, a structured user study was conducted comparing three subtitle presentation strategies: traditional fixed-position captions, speaker-aligned speech-bubble overlays, and the proposed dynamic placement method based on region-ofinterest selection. Participants rated each version based on viewing comfort and visual effort, with the results showing a clear preference for the proposed system in terms of usability and reduced eye strain.
AB - In this paper we introduce a novel dynamic subtitling system aimed at improving media accessibility for individuals with hearing impairments. The system is built upon an end-to-end active speaker detection framework that leverages joint audio and visual cues through multimodal feature integration. The architecture comprises two lightweight convolutional neural networks, each dedicated to a separate modality, with a cross-modal attention mechanism enhancing the interaction between them. This enables precise localization of the active speaker, allowing subtitle segments to be dynamically placed near the speaker's face to enhance readability and contextual alignment. The proposed approach is evaluated both quantitatively and qualitatively. On a benchmark dataset of 3 0 video samples, it achieves a gain of over 0.7 % in mean Average Precision compared to state-of-the-art ASD techniques. To evaluate the perceptual effectiveness of the system, a structured user study was conducted comparing three subtitle presentation strategies: traditional fixed-position captions, speaker-aligned speech-bubble overlays, and the proposed dynamic placement method based on region-ofinterest selection. Participants rated each version based on viewing comfort and visual effort, with the results showing a clear preference for the proposed system in terms of usability and reduced eye strain.
KW - active speaker detection
KW - audio-visual fusion
KW - cross-modal attention
KW - dynamic subtitles
KW - multimodal learning
UR - https://www.scopus.com/pages/publications/105029742423
U2 - 10.1109/EUVIP66349.2025.11238860
DO - 10.1109/EUVIP66349.2025.11238860
M3 - Conference contribution
AN - SCOPUS:105029742423
T3 - Proceedings - European Workshop on Visual Information Processing, EUVIP
BT - 2025 13th European Workshop on Visual Information Processing, EUVIP 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 13th European Workshop on Visual Information Processing, EUVIP 2025
Y2 - 13 October 2025 through 16 October 2025
ER -