Multimodal Emotion Recognition Method Based on Domain Generalization and Graph Neural Networks

Xie, J.; Wang, Y.; Meng, T.; Tai, J.; Zheng, Y.; Varatnitski 4, Yu.I.

doi:10.3390/electronics14050885

Даты публикации Авторы Заглавия Темы

Пожалуйста, используйте этот идентификатор, чтобы цитировать или ссылаться на этот документ: https://elib.bsu.by/handle/123456789/337595

Заглавие документа:	Multimodal Emotion Recognition Method Based on Domain Generalization and Graph Neural Networks
Авторы:	Xie, J. Wang, Y. Meng, T. Tai, J. Zheng, Y. Varatnitski 4, Yu.I.
Тема:	ЭБ БГУ::ТЕХНИЧЕСКИЕ И ПРИКЛАДНЫЕ НАУКИ. ОТРАСЛИ ЭКОНОМИКИ::Автоматика. Вычислительная техника
Дата публикации:	2025
Издатель:	MDPI
Библиографическое описание источника:	Electronics. 2025; 14: 885
Аннотация:	In recent years, multimodal sentiment analysis has attracted increasing attention from researchers owing to the rapid development of human–computer interactions. Sentiment analysis is an important task for understanding dialogues. However, with the increase of multimodal data, the processing of individual modality features and the methods for multimodal feature fusion have become more significant for research. Existing methods that handle the features of each modality separately are not suitable for subsequent multimodal fusion and often fail to capture sufficient global and local information. Therefore, this study proposes a novel multimodal sentiment analysis method based on domain generalization and graph neural networks. The main characteristic of this method is that it considers the features of each modality as domains. It extracts domain-specific and cross-domain-invariant features, thereby facilitating cross-domain generalization. Generalized features are more suitable for multimodal fusion. Graph neural networks were employed to extract global and local information from the dialogue to capture the emotional changes of the speakers. Specifically, global representations were captured by modeling cross-modal interactions at the dialogue level, whereas local information was typically inferred from temporal information or the emotional changes of the speakers. The method proposed in this study outperformed existing models on the IEMOCAP, CMU-MOSEI, and MELD datasets by 0.97%, 1.09% (for seven-class classification), and 0.65% in terms of weighted F1 score, respectively. This clearly demonstrates that the domain-generalized features proposed in this study are better suited for subsequent multimodal fusion, and that the model developed here is more effective at capturing both global and local information.
URI документа:	https://elib.bsu.by/handle/123456789/337595
DOI документа:	10.3390/electronics14050885
Scopus идентификатор документа:	86000555087
Финансовая поддержка:	This research was funded by the Education Department of Hainan Province (project number: Huky2022-19) and Key Project of Application Research on the National Smart Education Platform for Primary and Secondary Schools in Hainan Province.
Лицензия:	info:eu-repo/semantics/openAccess
Располагается в коллекциях:	Кафедра информатики и компьютерных систем. Статьи

Полный текст документа:

Файл	Описание	Размер	Формат
electronics-14-00885.pdf		1,11 MB	Adobe PDF	Открыть

Показать полное описание документа Статистика Google Scholar

Все документы в Электронной библиотеке защищены авторским правом, все права сохранены.