Cross-modal gated feature enhancement for multimodal emotion recognition in conversations.

Zhao, Shiyun; Ren, Jinchang; Zhou, Xiaojuan

doi:10.1038/s41598-025-11989-6

Cross-modal gated feature enhancement for multimodal emotion recognition in conversations.

Zhao, Shiyun; Ren, Jinchang; Zhou, Xiaojuan

Authors

Shiyun Zhao

Professor Jinchang Ren j.ren@rgu.ac.uk
Professor of Computing Science

Xiaojuan Zhou

Abstract

Emotion recognition in conversations (ERC), which involves identifying the emotional state of each utterance within a dialogue, plays a vital role in developing empathetic artificial intelligence systems. In practical applications, such as video-based recruitment interviews, customer service, health monitoring, intelligent personal assistants, and online education, ERC can facilitate the analysis of emotional cues, improve decision-making processes, and enhance user interaction and satisfaction. Current multimodal emotion recognition research faces several challenges, such as ineffective emotional information extraction from single modalities, underused complementary features, and inter-modal redundancy. To tackle these issues, this paper introduces a cross-modal gated attention mechanism for emotion recognition. The method extracts and fuses visual, textual, and auditory features to enhance accuracy and stability. A cross-modal guided gating mechanism is designed to strengthen single-modality features and utilize a third modality to improve bimodal feature fusion, boosting multimodal feature representation. Furthermore, a cross-modal distillation loss function is proposed to reduce redundancy and improve feature discrimination. This function employs a dual-supervision mechanism with teacher and student models, ensuring consistency in single-modal, bimodal, and trimodal feature representations. Experimental results on the IEMOCAP and MELD datasets indicate that the proposed method achieves higher accuracy and comparable F1 scores than existing approaches, highlighting its effectiveness in capturing multimodal dependencies and balancing modality contributions.

Citation

ZHAO, S., REN, J. and ZHOU, X. 2025. Cross-modal gated feature enhancement for multimodal emotion recognition in conversations. Scientific reports [online], 15, article number 30004. Available from: https://doi.org/10.1038/s41598-025-11989-6

Journal Article Type	Article
Acceptance Date	Jul 14, 2025
Online Publication Date	Aug 16, 2025
Publication Date	Dec 31, 2025
Deposit Date	Aug 29, 2025
Publicly Available Date	Aug 29, 2025
Journal	Scientific reports
Electronic ISSN	2045-2322
Publisher	Springer
Peer Reviewed	Peer Reviewed
Volume	15
Article Number	30004
DOI	https://doi.org/10.1038/s41598-025-11989-6
Keywords	Cross modal; Emotion recognition in conversations; Transformer; Deep learning; Gated attention
Public URL	https://rgu-repository.worktribe.com/output/2988398