Browse by author
Lookup NU author(s): Ting Zhu, Dr Huizhi LiangORCiD
This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).
© 2026 by the authors.There are still challenges in speech emotion recognition, as the representation capability of single-modal information is limited, there are difficulties in capturing continuous emotional transitions in discrete emotion annotations, and the issues of modal structural differences and cross-sample alignment in multimodal fusion methods persist. To address these, this study undertakes work from both data and model perspectives. For data, a Chinese multimodal database STEM-E2VA was constructed, synchronously collecting four modalities of data: articulatory kinematics, acoustics, glottal signals, and videos. This covers seven discrete emotion categories and employs PAD continuous annotation. By integrating discrete and continuous dimensional annotations, it better represents the distinction between strong and weak emotions under the same discrete emotion label. Concurrently, to process the biases in PAD annotations, we employed the SCL-90 psychological questionnaire to analyze annotators’ cognitive and emotional perceptions, thereby ensuring data reliability. For model, this paper proposes a multimodal supervised contrastive fusion network incorporating PAD perception. It employs a PAD-enhanced hybrid contrastive loss function to optimize intra-model and inter-modal feature alignment. Utilizing a cross-attention mechanism combined with a GRU–Transformer network for temporal feature extraction, it achieves deep fusion of multimodal information, reducing inter-modal discrepancies and cross-class confusion. Experiments demonstrate that the proposed method achieves 85.47% accuracy in discrete sentiment recognition on STEM-E2VA, with a substantial reduction in RMSE for PAD dimension prediction. It also exhibits excellent generalization capability on IEMOCAP, providing a novel framework for integrating discrete and continuous sentiment representations.
Author(s): Duan S, Zhang W, Li L, Zhu T, Zhao F, Li F, Liang H
Publication type: Article
Publication status: Published
Journal: Multimodal Technologies and Interaction
Year: 2026
Volume: 10
Issue: 4
Online publication date: 02/04/2026
Acceptance date: 31/03/2026
Date deposited: 06/05/2026
ISSN (electronic): 2414-4088
Publisher: MDPI
URL: https://doi.org/10.3390/mti10040038
DOI: 10.3390/mti10040038
Data Access Statement: The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy and ethical restrictions, as they contain sensitive multimodal information that could compromise the privacy of the participants.
Altmetrics provided by Altmetric