Speech emotion recognition using fine-tuned Wav2vec2.0 and neural controlled differential equations classifier.

0 Người đánh giá. Xếp hạng trung bình 0

Tác giả: Ni Wang, Danyu Yang

Ngôn ngữ: eng

Ký hiệu phân loại: 794.147 King

Thông tin xuất bản: United States : PloS one , 2025

Mô tả vật lý:

Bộ sưu tập: NCBI

ID: 220229

Thêm vào giỏ Liên kết toàn văn

Speech emotion recognition (SER) has always been a popular yet challenging task with broad applications in areas such as social media communication and medical diagnostics. Due to the characteristics of speech emotion recognition dataset, which often have small data volumes and high complexity, effectively integrating and modeling audio data remains a significant challenge in this field. To address this, we propose a model architecture that combines fine-tuned Wave2vec2.0 with Neural Controlled Differential Equations (NCDEs): First, we use a fine-tuned Wav2vec2.0 to extract rich contextual information. Then we model the high-dimensional time series feature set using a Neural Controlled Differential Equation classifier. We set the vector field as an MLP and update the model's hidden state by solving the controlled differential equation. We conducted speech emotion recognition experiments on the IEMOCAP dataset. The experiments show that our model achieves the weighted accuracy of 73.37% and the unweighted accuracy of 74.18%. Additionally, our model converges very quickly, reaching a good accuracy after just one epoch of training. Furthermore, our model exhibits excellent stability. The standard deviation of weighted accuracy (WA) is 0.45% and the standard deviation of unweighted accuracy (UA) is 0.39%.

Tạo bộ sưu tập với mã QR