Skip to main navigation Skip to search Skip to main content

Seeing Through Words: A Zero-Shot Multimodal Audio Description System with Foundation Models

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Audio description (AD) plays a crucial role in making video content accessible to visually impaired audiences, yet current approaches often rely on expensive supervised training or struggle to capture temporal and narrative consistency. We introduce a training-free framework that integrates vision–language models (VLMs) with large language models (LLMs) through three complementary mechanisms: semantic-constrained prompting to reduce irrelevant content, adaptive character reasoning for accurate entity grounding, and a memory structure that aligns fine-grained shot-level cues with longer scene-level context. This design allows the system to generate temporally coherent and context-aware AD without requiring additional training data. Evaluation on the MAD-eval-Named and TV-AD benchmarks demonstrates consistent improvements over state-of-the-art training-free methods, with gains in both lexical and semantic quality metrics.

Original languageEnglish
Title of host publicationAdvances in Visual Computing - 20th International Symposium, ISVC 2025, Proceedings
EditorsGeorge Bebis, Jinwei Ye, Yuxiong Wang, Mina Konakovic Lukovic, Nima Khademi Kalantari, Isaac Cho, Yalong Yang, Evanthia Dimara, Matthew Brehmer
PublisherSpringer Science and Business Media Deutschland GmbH
Pages85-97
Number of pages13
ISBN (Print)9783032144942
DOIs
Publication statusPublished - 1 Jan 2026
Event20th International Symposium on Visual Computing, ISVC 2025 - Las Vegas, United States
Duration: 17 Nov 202519 Nov 2025

Publication series

NameLecture Notes in Computer Science
Volume16397 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference20th International Symposium on Visual Computing, ISVC 2025
Country/TerritoryUnited States
CityLas Vegas
Period17/11/2519/11/25

Keywords

  • Audio description
  • Character recognition
  • Semantic prompting
  • Temporal memory
  • Video understanding

Fingerprint

Dive into the research topics of 'Seeing Through Words: A Zero-Shot Multimodal Audio Description System with Foundation Models'. Together they form a unique fingerprint.

Cite this