Emotion Recognition from Raw Speech Signals Using 2D CNN with Deep Metric Learning

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In this paper we have introduced a novel emotion recognition framework from raw speech signals. The system is based on ResNet architecture fed with spectrogram inputs. The CNN is further extended with a GhostVLAD feature aggregation layer that extracts a single, fixed size descriptor constructed at the level of the utterance. The system adopts a sentiment metric loss that integrates the relations between various classes of emotions. The experimental evaluation conducted on two publicly available databases: RAVDESS and CREMA-D validates the proposed methodology with average accuracy scores of 82% and 63%, respectively.

Original languageEnglish
Title of host publication2022 IEEE International Conference on Consumer Electronics, ICCE 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781665441544
DOIs
Publication statusPublished - 1 Jan 2022
Event2022 IEEE International Conference on Consumer Electronics, ICCE 2022 - Virtual, Online, United States
Duration: 7 Jan 20229 Jan 2022

Publication series

NameDigest of Technical Papers - IEEE International Conference on Consumer Electronics
Volume2022-January
ISSN (Print)0747-668X

Conference

Conference2022 IEEE International Conference on Consumer Electronics, ICCE 2022
Country/TerritoryUnited States
CityVirtual, Online
Period7/01/229/01/22

Keywords

  • GhostVLAD aggregation layer
  • multi-stage training
  • sentiment metric learning
  • speech emotion recognition

Fingerprint

Dive into the research topics of 'Emotion Recognition from Raw Speech Signals Using 2D CNN with Deep Metric Learning'. Together they form a unique fingerprint.

Cite this