Passer à la navigation principale Passer à la recherche Passer au contenu principal

A Lightweight Audio-Visual Speaker Detection System for Assistive Video Captioning

Résultats de recherche: Le chapitre dans un livre, un rapport, une anthologie ou une collectionContribution à une conférenceRevue par des pairs

Résumé

In this paper we introduce a novel dynamic subtitling system aimed at improving media accessibility for individuals with hearing impairments. The system is built upon an end-to-end active speaker detection framework that leverages joint audio and visual cues through multimodal feature integration. The architecture comprises two lightweight convolutional neural networks, each dedicated to a separate modality, with a cross-modal attention mechanism enhancing the interaction between them. This enables precise localization of the active speaker, allowing subtitle segments to be dynamically placed near the speaker's face to enhance readability and contextual alignment. The proposed approach is evaluated both quantitatively and qualitatively. On a benchmark dataset of 3 0 video samples, it achieves a gain of over 0.7 % in mean Average Precision compared to state-of-the-art ASD techniques. To evaluate the perceptual effectiveness of the system, a structured user study was conducted comparing three subtitle presentation strategies: traditional fixed-position captions, speaker-aligned speech-bubble overlays, and the proposed dynamic placement method based on region-ofinterest selection. Participants rated each version based on viewing comfort and visual effort, with the results showing a clear preference for the proposed system in terms of usability and reduced eye strain.

langue originaleAnglais
titre2025 13th European Workshop on Visual Information Processing, EUVIP 2025
EditeurInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronique)9798331575151
Les DOIs
étatPublié - 1 janv. 2025
Evénement13th European Workshop on Visual Information Processing, EUVIP 2025 - Valletta, Malte
Durée: 13 oct. 202516 oct. 2025

Série de publications

NomProceedings - European Workshop on Visual Information Processing, EUVIP
ISSN (imprimé)2471-8963

Une conférence

Une conférence13th European Workshop on Visual Information Processing, EUVIP 2025
Pays/TerritoireMalte
La villeValletta
période13/10/2516/10/25

Empreinte digitale

Examiner les sujets de recherche de « A Lightweight Audio-Visual Speaker Detection System for Assistive Video Captioning ». Ensemble, ils forment une empreinte digitale unique.

Contient cette citation