Audio-Video Fusion with Double Attention for Multimodal Emotion Recognition

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Recently, the multimodal emotion recognition has become a hot topic of research, within the affective computing community, due to its robust performances. In this paper, we propose to analyze emotions in an end-to-end manner based on various convolutional neural networks (CNN) architectures and attention mechanisms. Specifically, we develop a new framework that integrates the spatial and temporal attention into a visual 3D-CNN and temporal attention into an audio 2D-CNN in order to capture the intra-modal features characteristics. Further, the system is extended with an audio-video cross-attention fusion approach that effectively exploits the relationship across the two modalities. The proposed method achieves 87.89% of accuracy on RAVDESS dataset. When compared with state-of-the art methods our system demonstrates accuracy gains of more than 1.89%.

Original languageEnglish
Title of host publicationIVMSP 2022 - 2022 IEEE 14th Image, Video, and Multidimensional Signal Processing Workshop
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781665478229
DOIs
Publication statusPublished - 1 Jan 2022
Event14th IEEE Image, Video, and Multidimensional Signal Processing Workshop, IVMSP 2022 - Nafplio, Greece
Duration: 26 Jun 202229 Jun 2022

Publication series

NameIVMSP 2022 - 2022 IEEE 14th Image, Video, and Multidimensional Signal Processing Workshop

Conference

Conference14th IEEE Image, Video, and Multidimensional Signal Processing Workshop, IVMSP 2022
Country/TerritoryGreece
CityNafplio
Period26/06/2229/06/22

Keywords

  • cross-fusion
  • emotion recognition
  • spatial attention
  • temporal attention

Fingerprint

Dive into the research topics of 'Audio-Video Fusion with Double Attention for Multimodal Emotion Recognition'. Together they form a unique fingerprint.

Cite this