Active Speaker Recognition using Cross Attention Audio-Video Fusion

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The audio-video based multimodal active speaker recognition from video streams has attracted the attention of the scientific community due to its wide range of applications, such as human centered computing or semantic video understanding. Most of the existing techniques use early or late fusion audio- video (A-V) strategies without considering completely the inter- modal and intra-modal interactions. In this context, this research work proposes a novel cross-modal attention mechanism based on visual and audio modalities designed to capture the complex spatiotemporal relationship between descriptors and to fuse complementary information from multiple modalities. First, we perform the representation learning of audio and video using deep convolutional neural networks (CNNs). Secondly, we feed the features of both modalities to a cross attention block by fusing A-V features at the model level. Finally, we obtain the identity of the active speaker and associate to each character the corresponding subtitle segment. The experimental evaluation performed on 30 videos validates the approach with average F1-scores superior to 88%. The effectiveness of the proposed system architecture is compared against state-of-the-art methods and demonstrates accuracy gains of more than 3%.

Original languageEnglish
Title of host publication2022 10th European Workshop on Visual Information Processing, EUVIP 2022 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781665466233
DOIs
Publication statusPublished - 1 Jan 2022
Event10th European Workshop on Visual Information Processing, EUVIP 2022 - Lisbon, Portugal
Duration: 11 Sept 202214 Sept 2022

Publication series

NameProceedings - European Workshop on Visual Information Processing, EUVIP
Volume2022-September
ISSN (Print)2471-8963

Conference

Conference10th European Workshop on Visual Information Processing, EUVIP 2022
Country/TerritoryPortugal
CityLisbon
Period11/09/2214/09/22

Keywords

  • active speaker recognition
  • audio-video fusion
  • cross-attention
  • dynamic subtitle

Fingerprint

Dive into the research topics of 'Active Speaker Recognition using Cross Attention Audio-Video Fusion'. Together they form a unique fingerprint.

Cite this