Two-level fusion-based acoustic scene classification (original) (raw)
Growing demands from applications like surveillance, archiving, and context-aware devices have fuelled research towards efficient extraction of useful information from environmental sounds. Assigning a textual label to an audio segment based on the general characteristics of locations or situations is dealt with in acoustic scene classification (ASC). Because of the different nature of audio scenes, a single featureclassifier pair may not efficiently discriminate among environments. Also, the acoustic scenes might vary with the problem under investigation. However, for most of the ASC applications, rather than giving explicit scene labels (like home, park, etc.) a general estimate of the type of surroundings (e.g., indoor or outdoor) might be enough. In this paper, we propose a two-level hierarchical framework for ASC wherein finer labels follow coarse classification. At the first level, texture features extracted from time-frequency representation of the audio samples are used to generate the coarse labels. The system then explores combinations of six well-known spectral features, successfully used in different audio processing fields for second level classification to give finer details of the audio scene. The performance of the proposed system is compared with baseline methods using detection and classification of acoustic scenes and events (DCASE, 2016 and 2017) ASC databases, and found to be superior in terms of classification accuracy. Additionally, the proposed hierarchical method provides important intermediate results as coarse labels that may be useful in certain applications.