Emotion Recognition in Video Streams Using Intramodal and Intermodal Attention Mechanisms

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Automatic emotion recognition from video streams is an essential challenge for various applications including human behavior understanding, mental disease diagnosis, surveillance, or human-machine interaction. In this paper we introduce a novel, completely automatic, multimodal emotion recognition framework based on audio and visual fusion of information designed to leverage the mutually complementary nature of features while maintaining the modality-distinctive information. Specifically, we integrate the spatial, channel and temporal attention into the visual processing pipeline and the temporal self-attention into the audio branch. Then, a multimodal cross-attention fusion strategy is introduced that effectively exploits the relationship between the audio and video features. The experimental evaluation performed on RAVDESS, a publicly available database, validates the proposed approach with average accuracy scores superior to 87.85%. When compared with the state-of the art methods the proposed framework returns accuracy gains of more than 1.85%.

Original languageEnglish
Title of host publicationAdvances in Visual Computing - 17th International Symposium, ISVC 2022, Proceedings
EditorsGeorge Bebis, Bo Li, Angela Yao, Yang Liu, Ye Duan, Manfred Lau, Rajiv Khadka, Ana Crisan, Remco Chang
PublisherSpringer Science and Business Media Deutschland GmbH
Pages295-306
Number of pages12
ISBN (Print)9783031207150
DOIs
Publication statusPublished - 1 Jan 2022
Event17th International Symposium on Visual Computing, ISVC 2022 - San Diego, United States
Duration: 3 Oct 20225 Oct 2022

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13599 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference17th International Symposium on Visual Computing, ISVC 2022
Country/TerritoryUnited States
CitySan Diego
Period3/10/225/10/22

Keywords

  • Audio and video fusion
  • Cross-modal emotion recognition
  • Self-attention
  • Spatial/channel and temporal attention

Fingerprint

Dive into the research topics of 'Emotion Recognition in Video Streams Using Intramodal and Intermodal Attention Mechanisms'. Together they form a unique fingerprint.

Cite this