Multimodal Emotion Recognition Method Based on Domain Generalization and Graph Neural Networks

Xie, J.; Wang, Y.; Meng, T.; Tai, J.; Zheng, Y.; Varatnitski 4, Yu.I.

doi:10.3390/electronics14050885

Issue Date Author Title Subject

Please use this identifier to cite or link to this item: https://elib.bsu.by/handle/123456789/337595

Title:	Multimodal Emotion Recognition Method Based on Domain Generalization and Graph Neural Networks
Authors:	Xie, J. Wang, Y. Meng, T. Tai, J. Zheng, Y. Varatnitski 4, Yu.I.
Keywords:	ЭБ БГУ::ТЕХНИЧЕСКИЕ И ПРИКЛАДНЫЕ НАУКИ. ОТРАСЛИ ЭКОНОМИКИ::Автоматика. Вычислительная техника
Issue Date:	2025
Publisher:	MDPI
Citation:	Electronics. 2025; 14: 885
Abstract:	In recent years, multimodal sentiment analysis has attracted increasing attention from researchers owing to the rapid development of human–computer interactions. Sentiment analysis is an important task for understanding dialogues. However, with the increase of multimodal data, the processing of individual modality features and the methods for multimodal feature fusion have become more significant for research. Existing methods that handle the features of each modality separately are not suitable for subsequent multimodal fusion and often fail to capture sufficient global and local information. Therefore, this study proposes a novel multimodal sentiment analysis method based on domain generalization and graph neural networks. The main characteristic of this method is that it considers the features of each modality as domains. It extracts domain-specific and cross-domain-invariant features, thereby facilitating cross-domain generalization. Generalized features are more suitable for multimodal fusion. Graph neural networks were employed to extract global and local information from the dialogue to capture the emotional changes of the speakers. Specifically, global representations were captured by modeling cross-modal interactions at the dialogue level, whereas local information was typically inferred from temporal information or the emotional changes of the speakers. The method proposed in this study outperformed existing models on the IEMOCAP, CMU-MOSEI, and MELD datasets by 0.97%, 1.09% (for seven-class classification), and 0.65% in terms of weighted F1 score, respectively. This clearly demonstrates that the domain-generalized features proposed in this study are better suited for subsequent multimodal fusion, and that the model developed here is more effective at capturing both global and local information.
URI:	https://elib.bsu.by/handle/123456789/337595
DOI:	10.3390/electronics14050885
Scopus:	86000555087
Sponsorship:	This research was funded by the Education Department of Hainan Province (project number: Huky2022-19) and Key Project of Application Research on the National Smart Education Platform for Primary and Secondary Schools in Hainan Province.
Licence:	info:eu-repo/semantics/openAccess
Appears in Collections:	Кафедра информатики и компьютерных систем. Статьи

Files in This Item:

File	Description	Size	Format
electronics-14-00885.pdf		1,11 MB	Adobe PDF	View/Open

Show full item record Google Scholar