i.e., VGGish) is fine-tuned on target emotional datasets to learn segment-level speech features from the extracted Log Mel-spectrograms. Next, the unified first-order attention mechanism, including different feature-pooling strategies such as sum, min, max, mean, and standard deviation (std), is embedded into the output of a bi-directional long short-term memory (Bi-LSTM) network. This is used for learning high-level discriminative segment-level features, and simultaneously aggregating the learned segment-level features into fixed-length utterance-level features for SER. Finally, based on utterance-level features, the softmax layer in a Bi-LSTM network is adopted to conduct final emotion classification task. Extensive experiments are implemented on three public datasets such as BAUM-1s, AFEW5.0, and CHEAVD2.0, demonstrate the advantage of the proposed method.">

Speech Emotion Recognition by Combining a Unified First-Order Attention Network With Data Balance (original) (raw)

IEEE Account

Purchase Details

Profile Information

Need Help?

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.
© Copyright 2026 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.