Passer à la navigation principale Passer à la recherche Passer au contenu principal

Automatic Audio Description: A Training-Free Approach Using Foundation Models

  • University 'Politehnica' of Bucharest
  • Telecom Sudparis

Résultats de recherche: Le chapitre dans un livre, un rapport, une anthologie ou une collectionContribution à une conférenceRevue par des pairs

Résumé

In this paper, we propose a training-free framework for generating audio descriptions (ADs) by leveraging large pretrained Video-Language Models (VLMs) and Large Language Models (LLMs) without task-specific fine-tuning. Our method enhances video understanding through a semantic-constrained prompting strategy that incorporates temporally coherent context into VLM inputs, while an adaptive character recognition module ensures consistent identity tracking across frames. By explicitly linking visual character observations to narrative elements, the system produces contextually rich and coherent visual descriptions. Finally, the video captions are then refined into a single, concise audio description sentence through a LLM operating exclusively on text inputs, ensuring clarity, brevity, and narrative cohesion. The experimental evaluation performed on the MAD-eval-Named and TV-AD benchmarks, validates the approach achieving CIDEr scores of 23.2 and 23.4, respectively. Compared to state-of-the-art training-free baselines, our framework consistently yields relative improvements ranging from 3.6% to 8% across multiple evaluation metrics.

langue originaleAnglais
titreComputer Analysis of Images and Patterns - 21st International Conference, CAIP 2025, Proceedings
rédacteurs en chefModesto Castrillón-Santana, Carlos M. Travieso-González, David Freire-Obregón, Daniel Hernández-Sosa, Javier Lorenzo-Navarro, Oliverio J. Santana, Oscar Deniz Suarez
EditeurSpringer Science and Business Media Deutschland GmbH
Pages173-183
Nombre de pages11
ISBN (imprimé)9783032050595
Les DOIs
étatPublié - 1 janv. 2026
Evénement21st International Conference on Computer Analysis of Images and Patterns, CAIP 2025 - Las Palmas de Gran Canaria, Espagne
Durée: 22 sept. 202525 sept. 2025

Série de publications

NomLecture Notes in Computer Science
Volume15622 LNCS
ISSN (imprimé)0302-9743
ISSN (Electronique)1611-3349

Une conférence

Une conférence21st International Conference on Computer Analysis of Images and Patterns, CAIP 2025
Pays/TerritoireEspagne
La villeLas Palmas de Gran Canaria
période22/09/2525/09/25

Empreinte digitale

Examiner les sujets de recherche de « Automatic Audio Description: A Training-Free Approach Using Foundation Models ». Ensemble, ils forment une empreinte digitale unique.

Contient cette citation