Skip to main navigation Skip to search Skip to main content

Automatic Audio Description: A Training-Free Approach Using Foundation Models

  • University 'Politehnica' of Bucharest
  • Telecom Sudparis

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In this paper, we propose a training-free framework for generating audio descriptions (ADs) by leveraging large pretrained Video-Language Models (VLMs) and Large Language Models (LLMs) without task-specific fine-tuning. Our method enhances video understanding through a semantic-constrained prompting strategy that incorporates temporally coherent context into VLM inputs, while an adaptive character recognition module ensures consistent identity tracking across frames. By explicitly linking visual character observations to narrative elements, the system produces contextually rich and coherent visual descriptions. Finally, the video captions are then refined into a single, concise audio description sentence through a LLM operating exclusively on text inputs, ensuring clarity, brevity, and narrative cohesion. The experimental evaluation performed on the MAD-eval-Named and TV-AD benchmarks, validates the approach achieving CIDEr scores of 23.2 and 23.4, respectively. Compared to state-of-the-art training-free baselines, our framework consistently yields relative improvements ranging from 3.6% to 8% across multiple evaluation metrics.

Original languageEnglish
Title of host publicationComputer Analysis of Images and Patterns - 21st International Conference, CAIP 2025, Proceedings
EditorsModesto Castrillón-Santana, Carlos M. Travieso-González, David Freire-Obregón, Daniel Hernández-Sosa, Javier Lorenzo-Navarro, Oliverio J. Santana, Oscar Deniz Suarez
PublisherSpringer Science and Business Media Deutschland GmbH
Pages173-183
Number of pages11
ISBN (Print)9783032050595
DOIs
Publication statusPublished - 1 Jan 2026
Event21st International Conference on Computer Analysis of Images and Patterns, CAIP 2025 - Las Palmas de Gran Canaria, Spain
Duration: 22 Sept 202525 Sept 2025

Publication series

NameLecture Notes in Computer Science
Volume15622 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference21st International Conference on Computer Analysis of Images and Patterns, CAIP 2025
Country/TerritorySpain
CityLas Palmas de Gran Canaria
Period22/09/2525/09/25

Keywords

  • Audio description generation
  • semantic-constrained prompting
  • training-free methods
  • video-language models

Fingerprint

Dive into the research topics of 'Automatic Audio Description: A Training-Free Approach Using Foundation Models'. Together they form a unique fingerprint.

Cite this