Skip to main navigation Skip to search Skip to main content

A Lightweight Audio-Visual Speaker Detection System for Assistive Video Captioning

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In this paper we introduce a novel dynamic subtitling system aimed at improving media accessibility for individuals with hearing impairments. The system is built upon an end-to-end active speaker detection framework that leverages joint audio and visual cues through multimodal feature integration. The architecture comprises two lightweight convolutional neural networks, each dedicated to a separate modality, with a cross-modal attention mechanism enhancing the interaction between them. This enables precise localization of the active speaker, allowing subtitle segments to be dynamically placed near the speaker's face to enhance readability and contextual alignment. The proposed approach is evaluated both quantitatively and qualitatively. On a benchmark dataset of 3 0 video samples, it achieves a gain of over 0.7 % in mean Average Precision compared to state-of-the-art ASD techniques. To evaluate the perceptual effectiveness of the system, a structured user study was conducted comparing three subtitle presentation strategies: traditional fixed-position captions, speaker-aligned speech-bubble overlays, and the proposed dynamic placement method based on region-ofinterest selection. Participants rated each version based on viewing comfort and visual effort, with the results showing a clear preference for the proposed system in terms of usability and reduced eye strain.

Original languageEnglish
Title of host publication2025 13th European Workshop on Visual Information Processing, EUVIP 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798331575151
DOIs
Publication statusPublished - 1 Jan 2025
Event13th European Workshop on Visual Information Processing, EUVIP 2025 - Valletta, Malta
Duration: 13 Oct 202516 Oct 2025

Publication series

NameProceedings - European Workshop on Visual Information Processing, EUVIP
ISSN (Print)2471-8963

Conference

Conference13th European Workshop on Visual Information Processing, EUVIP 2025
Country/TerritoryMalta
CityValletta
Period13/10/2516/10/25

Keywords

  • active speaker detection
  • audio-visual fusion
  • cross-modal attention
  • dynamic subtitles
  • multimodal learning

Fingerprint

Dive into the research topics of 'A Lightweight Audio-Visual Speaker Detection System for Assistive Video Captioning'. Together they form a unique fingerprint.

Cite this