Developing a benchmark for emotional analysis of music - PubMed (original) (raw)

Developing a benchmark for emotional analysis of music

Anna Aljanaki et al. PLoS One. 2017.

Abstract

Music emotion recognition (MER) field rapidly expanded in the last decade. Many new methods and new audio features are developed to improve the performance of MER algorithms. However, it is very difficult to compare the performance of the new methods because of the data representation diversity and scarcity of publicly available data. In this paper, we address these problems by creating a data set and a benchmark for MER. The data set that we release, a MediaEval Database for Emotional Analysis in Music (DEAM), is the largest available data set of dynamic annotations (valence and arousal annotations for 1,802 songs and song excerpts licensed under Creative Commons with 2Hz time resolution). Using DEAM, we organized the 'Emotion in Music' task at MediaEval Multimedia Evaluation Campaign from 2013 to 2015. The benchmark attracted, in total, 21 active teams to participate in the challenge. We analyze the results of the benchmark: the winning algorithms and feature-sets. We also describe the design of the benchmark, the evaluation procedures and the data cleaning and transformations that we suggest. The results from the benchmark suggest that the recurrent neural network based approaches combined with large feature-sets work best for dynamic MER.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Fig 1

Fig 1. A labyrinth of data representation choices for a MER algorithm.

The choices that we made for the benchmark are highlighted in red.

Fig 2

Fig 2. Annotation interface for both continuous (upper-left corner) and static per song (middle; using the self-assessment manikins [43]) ratings of arousal.

Fig 3

Fig 3. Fitted GAMs for the arousal and valence annotations of two songs.

Fig 4

Fig 4

Liking of the music and confidence in rating for a) valence, Spearman’s ρ = 0.37, _p_-value = 2.2 × 10−16 b) arousal, Spearman’s ρ = 0.29, _p_-value = 2.2 × 10−16.

Fig 5

Fig 5. Krippendorff’s α of dynamic annotations in 2015, averaged over all dynamic samples.

Fig 6

Fig 6

Distribution of the labels on arousal-valence plane for a) development-set b) evaluation-set.

Similar articles

Cited by

References

    1. Inskip C, Macfarlane A, Rafferty P. Towards the disintermediation of creative music search: analysing queries to determine important facets. International Journal on Digital Libraries. 2012;12(2):137–147. 10.1007/s00799-012-0084-1 - DOI
    1. Yang YH, Chen HH. Machine recognition of music emotion: A review. ACM Transactions on Intelligent Systems and Technology. 2012;3(3). 10.1145/2168752.2168754 - DOI
    1. Kim YE, Schmidt EM, Migneco R, Morton BG, Richardson P, Scott J, et al. Music emotion recognition: A state of the art review. In: Proceedings of International Society for Music Information Retrieval Conference; 2010. p. 255–266.
    1. Laurier C, Lartillot O, Eerola T, Toiviainen P. Exploring relationships between audio features and emotion in music. In: Proceedings of Triennal Conference of European Society for Cognitive Sciences of Music; 2009. p. 260–264.
    1. Yang YH, Lin YC, Su YF, Chen HH. A regression approach to music emotion recognition. IEEE Transactions on Audio, Speech, and Language Processing. 2008;16(2):448–457. 10.1109/TASL.2007.911513 - DOI

MeSH terms

Grants and funding

AA’s work was supported by COMMIT/ project (commit-nl.nl). MS was supported by Ambizione program of the Swiss National Science Foundation (Grant number PZ00P2_154981).

LinkOut - more resources