TY - GEN
T1 - Emotion Recognition in Video Streams Using Intramodal and Intermodal Attention Mechanisms
AU - Mocanu, Bogdan
AU - Tapu, Ruxandra
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2022/1/1
Y1 - 2022/1/1
N2 - Automatic emotion recognition from video streams is an essential challenge for various applications including human behavior understanding, mental disease diagnosis, surveillance, or human-machine interaction. In this paper we introduce a novel, completely automatic, multimodal emotion recognition framework based on audio and visual fusion of information designed to leverage the mutually complementary nature of features while maintaining the modality-distinctive information. Specifically, we integrate the spatial, channel and temporal attention into the visual processing pipeline and the temporal self-attention into the audio branch. Then, a multimodal cross-attention fusion strategy is introduced that effectively exploits the relationship between the audio and video features. The experimental evaluation performed on RAVDESS, a publicly available database, validates the proposed approach with average accuracy scores superior to 87.85%. When compared with the state-of the art methods the proposed framework returns accuracy gains of more than 1.85%.
AB - Automatic emotion recognition from video streams is an essential challenge for various applications including human behavior understanding, mental disease diagnosis, surveillance, or human-machine interaction. In this paper we introduce a novel, completely automatic, multimodal emotion recognition framework based on audio and visual fusion of information designed to leverage the mutually complementary nature of features while maintaining the modality-distinctive information. Specifically, we integrate the spatial, channel and temporal attention into the visual processing pipeline and the temporal self-attention into the audio branch. Then, a multimodal cross-attention fusion strategy is introduced that effectively exploits the relationship between the audio and video features. The experimental evaluation performed on RAVDESS, a publicly available database, validates the proposed approach with average accuracy scores superior to 87.85%. When compared with the state-of the art methods the proposed framework returns accuracy gains of more than 1.85%.
KW - Audio and video fusion
KW - Cross-modal emotion recognition
KW - Self-attention
KW - Spatial/channel and temporal attention
U2 - 10.1007/978-3-031-20716-7_23
DO - 10.1007/978-3-031-20716-7_23
M3 - Conference contribution
AN - SCOPUS:85145261079
SN - 9783031207150
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 295
EP - 306
BT - Advances in Visual Computing - 17th International Symposium, ISVC 2022, Proceedings
A2 - Bebis, George
A2 - Li, Bo
A2 - Yao, Angela
A2 - Liu, Yang
A2 - Duan, Ye
A2 - Lau, Manfred
A2 - Khadka, Rajiv
A2 - Crisan, Ana
A2 - Chang, Remco
PB - Springer Science and Business Media Deutschland GmbH
T2 - 17th International Symposium on Visual Computing, ISVC 2022
Y2 - 3 October 2022 through 5 October 2022
ER -