PAD-Guided Multimodal Hybrid Contrastive Emotion Recognition upon STEM-E2VA Dataset

Duan, S; Zhang, W; Li, L; Zhu, T; Zhao, F; Li, F; Liang, H

doi:10.3390/mti10040038

PAD-Guided Multimodal Hybrid Contrastive Emotion Recognition upon STEM-E2VA Dataset

Lookup NU author(s): Ting Zhu, Dr Huizhi Liang ORCiD

Downloads

Published version [.pdf]

Licence

This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

Abstract

© 2026 by the authors.There are still challenges in speech emotion recognition, as the representation capability of single-modal information is limited, there are difficulties in capturing continuous emotional transitions in discrete emotion annotations, and the issues of modal structural differences and cross-sample alignment in multimodal fusion methods persist. To address these, this study undertakes work from both data and model perspectives. For data, a Chinese multimodal database STEM-E2VA was constructed, synchronously collecting four modalities of data: articulatory kinematics, acoustics, glottal signals, and videos. This covers seven discrete emotion categories and employs PAD continuous annotation. By integrating discrete and continuous dimensional annotations, it better represents the distinction between strong and weak emotions under the same discrete emotion label. Concurrently, to process the biases in PAD annotations, we employed the SCL-90 psychological questionnaire to analyze annotators’ cognitive and emotional perceptions, thereby ensuring data reliability. For model, this paper proposes a multimodal supervised contrastive fusion network incorporating PAD perception. It employs a PAD-enhanced hybrid contrastive loss function to optimize intra-model and inter-modal feature alignment. Utilizing a cross-attention mechanism combined with a GRU–Transformer network for temporal feature extraction, it achieves deep fusion of multimodal information, reducing inter-modal discrepancies and cross-class confusion. Experiments demonstrate that the proposed method achieves 85.47% accuracy in discrete sentiment recognition on STEM-E2VA, with a substantial reduction in RMSE for PAD dimension prediction. It also exhibits excellent generalization capability on IEMOCAP, providing a novel framework for integrating discrete and continuous sentiment representations.

Publication metadata

Author(s): Duan S, Zhang W, Li L, Zhu T, Zhao F, Li F, Liang H

Publication type: Article

Publication status: Published

Journal: Multimodal Technologies and Interaction

Year: 2026

Volume: 10

Issue: 4

Online publication date: 02/04/2026

Acceptance date: 31/03/2026

Date deposited: 06/05/2026

ISSN (electronic): 2414-4088

Publisher: MDPI

URL: https://doi.org/10.3390/mti10040038

DOI: 10.3390/mti10040038

Data Access Statement: The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy and ethical restrictions, as they contain sensitive multimodal information that could compromise the privacy of the participants.

Altmetrics

Altmetrics provided by Altmetric

Funding

Funder reference	Funder name
Natural Science Foundation of Shanxi Province, China (No. 202403021211098),
Shanxi Scholarship Council of China (No. 2024-060)
Youth Fund of the National Natural Science Foundation of China (No. 12004275)

ePrints

PAD-Guided Multimodal Hybrid Contrastive Emotion Recognition upon STEM-E2VA Dataset

Downloads

Licence

Abstract

Publication metadata

Altmetrics

Funding

Share