TY - GEN
T1 - Automatic Audio Description
T2 - 21st International Conference on Computer Analysis of Images and Patterns, CAIP 2025
AU - Tapu, Ruxandra
AU - Mocanu, Bogdan
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.
PY - 2026/1/1
Y1 - 2026/1/1
N2 - In this paper, we propose a training-free framework for generating audio descriptions (ADs) by leveraging large pretrained Video-Language Models (VLMs) and Large Language Models (LLMs) without task-specific fine-tuning. Our method enhances video understanding through a semantic-constrained prompting strategy that incorporates temporally coherent context into VLM inputs, while an adaptive character recognition module ensures consistent identity tracking across frames. By explicitly linking visual character observations to narrative elements, the system produces contextually rich and coherent visual descriptions. Finally, the video captions are then refined into a single, concise audio description sentence through a LLM operating exclusively on text inputs, ensuring clarity, brevity, and narrative cohesion. The experimental evaluation performed on the MAD-eval-Named and TV-AD benchmarks, validates the approach achieving CIDEr scores of 23.2 and 23.4, respectively. Compared to state-of-the-art training-free baselines, our framework consistently yields relative improvements ranging from 3.6% to 8% across multiple evaluation metrics.
AB - In this paper, we propose a training-free framework for generating audio descriptions (ADs) by leveraging large pretrained Video-Language Models (VLMs) and Large Language Models (LLMs) without task-specific fine-tuning. Our method enhances video understanding through a semantic-constrained prompting strategy that incorporates temporally coherent context into VLM inputs, while an adaptive character recognition module ensures consistent identity tracking across frames. By explicitly linking visual character observations to narrative elements, the system produces contextually rich and coherent visual descriptions. Finally, the video captions are then refined into a single, concise audio description sentence through a LLM operating exclusively on text inputs, ensuring clarity, brevity, and narrative cohesion. The experimental evaluation performed on the MAD-eval-Named and TV-AD benchmarks, validates the approach achieving CIDEr scores of 23.2 and 23.4, respectively. Compared to state-of-the-art training-free baselines, our framework consistently yields relative improvements ranging from 3.6% to 8% across multiple evaluation metrics.
KW - Audio description generation
KW - semantic-constrained prompting
KW - training-free methods
KW - video-language models
UR - https://www.scopus.com/pages/publications/105024547668
U2 - 10.1007/978-3-032-05060-1_15
DO - 10.1007/978-3-032-05060-1_15
M3 - Conference contribution
AN - SCOPUS:105024547668
SN - 9783032050595
T3 - Lecture Notes in Computer Science
SP - 173
EP - 183
BT - Computer Analysis of Images and Patterns - 21st International Conference, CAIP 2025, Proceedings
A2 - Castrillón-Santana, Modesto
A2 - Travieso-González, Carlos M.
A2 - Freire-Obregón, David
A2 - Hernández-Sosa, Daniel
A2 - Lorenzo-Navarro, Javier
A2 - Santana, Oliverio J.
A2 - Deniz Suarez, Oscar
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 22 September 2025 through 25 September 2025
ER -