SubSpectralNet – Using Sub-spectrogram Based Convolutional Neural Networks for Acoustic Scene Classification (original) (raw)

Abstract

Acoustic Scene Classification (ASC) is one of the core research problems in the field of Computational Sound Scene Analysis. In this work, we present SubSpectralNet, a novel model which captures discriminative features by incorporating frequency band-level differences to model soundscapes. Using mel-spectrograms, we propose the idea of using band-wise crops of the input time-frequency representations and train a convolutional neural network (CNN) on the same. We also propose a modification in the training method for more efficient learning of the CNN models. We first give a motivation for using sub-spectrograms by giving intuitive and statistical analyses and finally we develop a sub-spectrogram based CNN architecture for ASC. The system is evaluated on the public ASC development dataset provided for the "Detection and Classification of Acoustic Scenes and Events" (DCASE) 2018 Challenge. Our best model achieves an improvement of +14% in terms of classification accuracy with respect to the DCASE 2018 baseline system. Code and figures are available at https:

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (28)

REFERENCES
Tuomas Virtanen, Mark D Plumbley, and Dan Ellis, Computa- tional analysis of sound scenes and events, Springer, 2018.
Antti J Eronen, Vesa T Peltonen, Juha T Tuomi, Anssi P Kla- puri, Seppo Fagerlund, Timo Sorsa, Gaëtan Lorho, and Jyri Huopaniemi, "Audio-based context recognition," IEEE Trans- actions on Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 321-329, 2006.
Dan Stowell, Dimitrios Giannoulis, Emmanouil Benetos, Mathieu Lagrange, and Mark D Plumbley, "Detection and clas- sification of acoustic scenes and events," IEEE Transactions on Multimedia, vol. 17, no. 10, pp. 1733-1746, 2015.
Annamaria Mesaros, Toni Heittola, Emmanouil Benetos, Pe- ter Foster, Mathieu Lagrange, Tuomas Virtanen, and Mark D Plumbley, "Detection and classification of acoustic scenes and events: Outcome of the DCASE 2016 challenge," IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 26, no. 2, pp. 379-393, 2018.
Yifang Yin, Rajiv Ratn Shah, and Roger Zimmermann, "Learning and fusing multimodal deep features for acoustic scene categorization," in 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 2018, pp. 1892-1900.
Tuomas Virtanen, Annamaria Mesaros, Toni Heittola, Alek- sandr Diment, Emmanuel Vincent, Emmanouil Benetos, and Benjamin Martinez Elizalde, Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), Tampere University of Technology. Laboratory of Signal Processing, 2017.
Mathieu Lagrange, Grégoire Lafay, Boris Defreville, and Jean- Julien Aucouturier, "The bag-of-frames approach: a not so sufficient model for urban soundscapes," The Journal of the Acoustical Society of America, vol. 138, no. 5, pp. EL487- EL492, 2015.
Victor Bisot, Romain Serizel, Slim Essid, and Gaël Richard, "Feature learning with matrix factorization applied to acous- tic scene classification," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1216- 1229, 2017.
Huy Phan, Lars Hertel, Marco Maass, Philipp Koch, and Al- fred Mertins, "Label tree embeddings for acoustic scene clas- sification," in Proceedings of the 2016 ACM on Multimedia Conference. ACM, 2016, pp. 486-490.
Jean-Julien Aucouturier, Boris Defreville, and Francois Pachet, "The bag-of-frames approach to audio pattern recognition: A sufficient model for urban soundscapes but not for polyphonic music," The Journal of the Acoustical Society of America, vol. 122, no. 2, pp. 881-891, 2007.
Jurgen T Geiger, Bjorn Schuller, and Gerhard Rigoll, "Large- scale audio feature extraction and svm for acoustic scene clas- sification," in Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on. IEEE, 2013, pp. 1-4.
Benjamin Cauchi, Mathieu Lagrange, Nicolas Misdariis, and Arshia Cont, "Saliency-based modeling of acoustic scenes us- ing sparse non-negative matrix factorization," in Image Anal- ysis for Multimedia Interactive Services (WIAMIS), 2013 14th International Workshop on. IEEE, 2013, pp. 1-4.
Victor Bisot, Romain Serizel, Slim Essid, and Gaël Richard, "Acoustic scene classification with matrix factorization for un- supervised feature learning," in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 6445-6449.
Michele Valenti, Aleksandr Diment, Giambattista Parascan- dolo, Stefano Squartini, and Tuomas Virtanen, "DCASE 2016 acoustic scene classification using convolutional neural net- works," IEEE AASP Challenge on Detection and Classifica- tion of Acoustic Scenes and Events (DCASE2016), Budapest, Hungary, 2016.
Karol J Piczak, "Environmental sound classification with con- volutional neural networks," in Machine Learning for Signal Processing (MLSP), 2015 IEEE 25th International Workshop on. IEEE, 2015, pp. 1-6.
Victor Bisot, Slim Essid, and Gaël Richard, "Hog and subband power distribution image features for acoustic scene classifica- tion," in Signal Processing Conference (EUSIPCO), 2015 23rd European. IEEE, 2015, pp. 719-723.
Alain Rakotomamonjy and Gilles Gasso, "Histogram of gradi- ents of time-frequency representations for audio scene classifi- cation," IEEE/ACM Transactions on Audio, Speech and Lan- guage Processing (TASLP), vol. 23, no. 1, pp. 142-153, 2015.
Karol J Piczak, "Environmental sound classification with con- volutional neural networks," in Machine Learning for Signal Processing (MLSP), 2015 IEEE 25th International Workshop on. IEEE, 2015, pp. 1-6.
Karol J Piczak, "The details that matter: Frequency resolution of spectrograms in acoustic scene classification," Detection and Classification of Acoustic Scenes and Events, 2017.
Yoonchang Han, Jeongsoo Park, and Kyogu Lee, "Con- volutional neural networks with binaural representations and background subtraction for acoustic scene classification," the Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 1-5, 2017.
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen, "A multi-device dataset for urban acoustic scene classification," in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), November 2018, pp. 9-13.
Solomon Kullback and Richard A Leibler, "On information and sufficiency," The annals of mathematical statistics, vol. 22, no. 1, pp. 79-86, 1951.
Rudolf Beran et al., "Minimum hellinger distance estimates for parametric models," The annals of Statistics, vol. 5, no. 3, pp. 445-463, 1977.
Sai Samarth R Phaye, Apoorva Sikka, Abhinav Dhall, and Deepti Bathula, "Multi-level dense capsule networks," in Proc. of the Asian Conference on Computer Vision (ACCV), 2018.
Jie Hu, Li Shen, and Gang Sun, "Squeeze-and-excitation net- works," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132-7141.
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil- ian Q Weinberger, "Densely connected convolutional net- works," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2261-2269.
Toni Heittola, "DCASE UTIL: utilities for detection and classification of acoustic scenes," 2018, https://dcase- repo.github.io/dcase util/index.html.