Shiyun Zhao
Cross-modal gated feature enhancement for multimodal emotion recognition in conversations.
Zhao, Shiyun; Ren, Jinchang; Zhou, Xiaojuan
Abstract
Emotion recognition in conversations (ERC), which involves identifying the emotional state of each utterance within a dialogue, plays a vital role in developing empathetic artificial intelligence systems. In practical applications, such as video-based recruitment interviews, customer service, health monitoring, intelligent personal assistants, and online education, ERC can facilitate the analysis of emotional cues, improve decision-making processes, and enhance user interaction and satisfaction. Current multimodal emotion recognition research faces several challenges, such as ineffective emotional information extraction from single modalities, underused complementary features, and inter-modal redundancy. To tackle these issues, this paper introduces a cross-modal gated attention mechanism for emotion recognition. The method extracts and fuses visual, textual, and auditory features to enhance accuracy and stability. A cross-modal guided gating mechanism is designed to strengthen single-modality features and utilize a third modality to improve bimodal feature fusion, boosting multimodal feature representation. Furthermore, a cross-modal distillation loss function is proposed to reduce redundancy and improve feature discrimination. This function employs a dual-supervision mechanism with teacher and student models, ensuring consistency in single-modal, bimodal, and trimodal feature representations. Experimental results on the IEMOCAP and MELD datasets indicate that the proposed method achieves higher accuracy and comparable F1 scores than existing approaches, highlighting its effectiveness in capturing multimodal dependencies and balancing modality contributions.
Citation
ZHAO, S., REN, J. and ZHOU, X. 2025. Cross-modal gated feature enhancement for multimodal emotion recognition in conversations. Scientific reports [online], 15, article number 30004. Available from: https://doi.org/10.1038/s41598-025-11989-6
Journal Article Type | Article |
---|---|
Acceptance Date | Jul 14, 2025 |
Online Publication Date | Aug 16, 2025 |
Publication Date | Dec 31, 2025 |
Deposit Date | Aug 29, 2025 |
Publicly Available Date | Aug 29, 2025 |
Journal | Scientific reports |
Electronic ISSN | 2045-2322 |
Publisher | Springer |
Peer Reviewed | Peer Reviewed |
Volume | 15 |
Article Number | 30004 |
DOI | https://doi.org/10.1038/s41598-025-11989-6 |
Keywords | Cross modal; Emotion recognition in conversations; Transformer; Deep learning; Gated attention |
Public URL | https://rgu-repository.worktribe.com/output/2988398 |
Files
ZHAO 2025 Cross-modal gated(VOR)
(3.1 Mb)
PDF
Publisher Licence URL
https://creativecommons.org/licenses/by-nc-nd/4.0/
Copyright Statement
© The Author(s) 2025.
You might also like
FusDreamer: label-efficient remote sensing world model for multimodal data classification.
(2025)
Journal Article
MDDNet: multilevel difference-enhanced denoise network for unsupervised change detection in SAR images.
(2025)
Presentation / Conference Contribution
Downloadable Citations
About OpenAIR@RGU
Administrator e-mail: publications@rgu.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2025
Advanced Search