Multimodal active speaker detection using cross-Attention and contextual information

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

An active speaker detection (ASD) framework is aimed to identify whether an on-screen person is speaking or not in each frame of the video. In this paper, we introduce a novel ASD system by mindful integration of audio-video cues through a cross-Attention module to capture inter-modal information while retaining the distinct intra-modal features. Furthermore, the system models the inter-speaker relations between the speakers within the same scene. The experimental evaluation validates the effectiveness of the approach, achieving an average mAP score of 94.8%.

Original languageEnglish
Title of host publication2024 IEEE International Conference on Consumer Electronics, ICCE 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350324136
DOIs
Publication statusPublished - 1 Jan 2024
Event2024 IEEE International Conference on Consumer Electronics, ICCE 2024 - Las Vegas, United States
Duration: 6 Jan 20248 Jan 2024

Publication series

NameDigest of Technical Papers - IEEE International Conference on Consumer Electronics
ISSN (Print)0747-668X
ISSN (Electronic)2159-1423

Conference

Conference2024 IEEE International Conference on Consumer Electronics, ICCE 2024
Country/TerritoryUnited States
CityLas Vegas
Period6/01/248/01/24

Keywords

  • contextual speaker relations
  • cross-Attention block
  • multimodal active speaker detection

Fingerprint

Dive into the research topics of 'Multimodal active speaker detection using cross-Attention and contextual information'. Together they form a unique fingerprint.

Cite this