Spectrogram Research Papers - Academia.edu (original) (raw)

Acoustical bird monitoring system is a new tool under construction, which will provide automatic support for bird species recognition. The project is an interdisciplinary research which involves specialists from ecology, biology,... more

Acoustical bird monitoring system is a new tool under construction, which will provide automatic support for bird species recognition. The project is an interdisciplinary research which involves specialists from ecology, biology, database, electronics, electro- acoustics as well as experts from nature protection institutions. One of the crucial aspects in the project are bird voices recordings. The paper presents the methods of recordings, compulsory and optional information accompanying the recordings, exemplary species chosen for recordings, tools applied for data analysis. In the years 2008 and 2009 seventy six scientific expeditions dedicated to bird species recordings were undertaken. The collected and acoustically analysed material was about 152 hours of recordings. 49 bird species vocalizations were recorded and analysed.

Hearing a species in a tropical rainforest is much easier than seeing them. If someone is in the forest, he might not be able to look around and see every type of bird and frog that are there but they can be heard. A forest ranger might... more

Hearing a species in a tropical rainforest is much easier than seeing them. If someone is in the forest, he might not be able to look around and see every type of bird and frog that are there but they can be heard. A forest ranger might know what to do in these situations and he/she might be an expert in recognizing the different type of insects and dangerous species that are out there in the forest but if a common person travels to a rain forest for an adventure, he might not even know how to recognize these species, let alone taking suitable action against them. In this work, a model is built that can take audio signal as input, perform intelligent signal processing for extracting features and patterns, and output which type of species is present in the audio signal. The model works end to end and can work on raw input and a pipeline is also created to perform all the preprocessing steps on the raw input. In this work, different types of neural network architectures based on Long Short Term Memory (LSTM) and Convolution Neural Network (CNN) are tested. Both are showing reliable performance, CNN shows an accuracy of 95.62% and Log Loss of 0.21 while LSTM shows an accuracy of 93.12% and Log Loss of 0.17. Based on these results, it is shown that CNN performs better than LSTM in terms of accuracy while LSTM performs better than CNN in terms of Log Loss. Further, both of these models are combined to achieve high accuracy and low Log Loss. A combination of both LSTM and CNN shows an accuracy of 97.12% and a Log Loss of 0.16.

In this paper, we present an approach based on convolutional neural networks to build an automatic speech recognition system for the Amazigh language. This system is built with TensorFlow and uses mel frequency cepstral coefficient (MFCC)... more

In this paper, we present an approach based on convolutional neural networks to build an automatic speech recognition system for the Amazigh language. This system is built with TensorFlow and uses mel frequency cepstral coefficient (MFCC) to extract features. In order to test the effect of the speaker's gender and age on the accuracy of the model, the system was trained and tested on several datasets. The first experiment the dataset consists of 9240 audio files. The second experiment the dataset consists of 9240 audio files distributed between females and males' speakers. The last experiment 3 the dataset consists of 13860 audio files distributed between age 9-15, age 16-30, and age 30+. The result shows that the model trained on a dataset of adult speaker's age +30 categories generates the best accuracy with 93.9%.

This paper discusses music analysis in relation to images of musical sound using spectral or Fourier analysis. Fourier analysis allows us to make pictures of musical performances, which reveal the fundamental, overtones, and complex... more

This paper discusses music analysis in relation to images of musical sound using spectral or Fourier analysis. Fourier analysis allows us to make pictures of musical performances, which reveal the fundamental, overtones, and complex noise-like sounds. These images provide an acoustic basis for our musical perceptions. The recent history of Fourier analysis in relation to musical analysis is discussed first, and then three analyses are presented using works from a variety of periods and cultures, including a prelude by Chopin, a Japanese gagaku work, and a computer piece by Markus Popp. Chopin is discussed in relation to comparative performances, Japanese gagaku is explored in relation to timbre, and Markus Popp is looked at in relation to non-notated electronic music. Each work highlights a specific approach that can be taken using spectral or Fourier analysis.

Most of the prior research which has been carried out on audio recognition has been done in speech and music. Only in recent years, dozens of emerging works have been conducted on Environmental Sound Recognition and has gained importance.... more

Most of the prior research which has been carried out on audio recognition has been done in speech and music. Only in recent years, dozens of emerging works have been conducted on Environmental Sound Recognition and has gained importance. For the purpose of audio classification, many previous efforts utilize acoustic features such as Mel-frequency Cepstral Coefficients (MFCCs), Zero Crossing Rate (ZCR), Root Mean Square Error (RMSE), spectral centroid, spectral bandwidth and other frequency domain features derived from the spectrogram of the audio. In this paper, we use a slightly different approach of feature extraction, where we summarize short audio clips of about five seconds by segmenting out the most prominent part of the audio signal. We then compute spectrogram image of the segmented audio, and divide it into different sub-bands with respect to the frequency axis. For each of the sub-bands, we extract first order statistics and Gray Level Concurrence Matrix (GLCM) features. In the classification stage, we combine two SVM (Support Vector Machines) classifiers. The first classifier uses first order statistics and GLCM features. The second classifier uses acoustic features such as MFCCs, ZCR, RMSE, spectral centroid, spectral bandwidth and other frequency domain features derived from the spectrogram of the audio to obtain the final result. We evaluate our approach on two publicly available datasets, namely, ESC-10 and Freiburg-106 with a five-fold and a tenfold cross validation for ESC-10 dataset and Freiburg-106 dataset respectively. Experiments show that the proposed approach outperforms the baselines and provides similar results compared to the state-of-art.

Outre le choix des notes, qui est prescrit par une tradition commune, qu’est-ce qui distingue un improvisateur jazz d’un.e autre ? Qu’est-ce qui constitue leur « son », leur « signature » ? Les méthodes d’analyses plus traditionnelles... more

Outre le choix des notes, qui est prescrit par une tradition commune, qu’est-ce qui distingue un improvisateur jazz d’un.e autre ? Qu’est-ce qui constitue leur « son », leur « signature » ? Les méthodes d’analyses plus traditionnelles n’étant pas suffisantes pour circonscrire l’entièreté de l’expression contenue dans l’improvisation jazz, nous devons nous tourner vers l’étude d’enregistrements de solos improvisés afin d’examiner la « performance » musicale, c’est-à-dire les éléments liés à la production du son par l’instrumentiste. Cependant, l’oreille pouvant omettre certains éléments au profit d’autres, nous avons adapté un modèle d’analyse par spectrogramme utilisé pour le chant en musique populaire. En ajoutant le visuel à l’auditif, ce modèle permet un inventaire plus exhaustif des « gestes sonores » contenus dans les improvisations, c’est-à-dire toutes manifestations sonores produites par des gestes instrumentistes effectués au sein d’un processus musical. Tirée d’un projet de recherche-création, une méthode autoethnographique nous permet de distinguer des gestes sonores tirés de mes propres improvisations et ainsi le comparer avec celles du saxophoniste de réputation internationale Chris Potter. Les analyses distingues cinq gestes sonores, subtils mais distinctifs, observés à partir des microvariations de hauteurs lors de l’attaque ou le déclin de certaines notes, de l’utilisation particulière de sons multiphoniques, du vibrato, des jeux de dynamiques et de bruits de clés.

Power quality disturbances present noteworthy ramifications on electricity consumers, which can affect manufacturing process, causing malfunction of equipment and inducing economic losses. Thus, an automated system is required to identify... more

Power quality disturbances present noteworthy ramifications on electricity consumers, which can affect manufacturing process, causing malfunction of equipment and inducing economic losses. Thus, an automated system is required to identify and classify the signals for diagnosis purposes. The development of power quality disturbances detection and classification system using linear time-frequency distribution (TFD) technique which is spectrogram is presented in this paper. The TFD is used to represent the signals in time-frequency representation (TFR), hence it is handy for analyzing power quality disturbances. The signal parameters such as instantaneous of RMS voltage, RMS fundamental voltage, total waveform distortion (TWD), total harmonic distortion (THD) and total non-harmonic distortion (TnHD) are estimated from the TFR to identify the characteristic of the signals. The signal characteristics are then served as the input for signal classifier to classify power quality disturbance...

Convolution reverb is the process used for reverberating a signal by an impulse response of an actual space. The product of this operation is a third signal, containing reverberation of the space where the impulse response is captured.... more

Convolution reverb is the process used for reverberating a signal by an impulse response of an actual space. The product of this operation is a third signal, containing reverberation of the space where the impulse response is captured. This project is based on creating a convolution reverb plugin on MaxMSP and evaluating the perceptual quality of convolution reverb.
The real-time fast convolution reverb plugin was implemented using multiple frequency delay line non-uniform partitioned convolution method on MaxMSP. An experimental design methodology was introduced in order to evaluate the perceptual quality of convolution reverb. In order to realise, anechoic drum kit music samples were recorded and re-recorded in a chamber to form the control signal. Also, using same equipment and setup, an impulse response was captured in the same chamber to form the test sample. These samples were then used for subjective analysis in a pair-wise categorical preference type listening test. The samples were also analysed in their spectrogram views in order to analyse the quality objectively.
Listening tests were designed so that realism, quality and personal preference categories were present. Null hypothesis was proved, where the realism difference of the two samples resulted in 52% of the participants preferring chamber reverb (control signal). However personal preference category resulted in 61%of the subjects preferring the chamber reverb. This was justified by objective analysis, where the convolution reverb shown to be having a faster decay rate for high frequency bands, thus sounding unnatural.

“Los bosques de cemento” es una composición musical para orquesta sinfónica de unos 6 minutos de duración compuesta a finales de 2013. Esta obra resultó ganadora en el I Concurso de Composición para Orquesta Sinfónica “Emilio Lehmberg”... more

“Los bosques de cemento” es una composición musical para orquesta sinfónica de unos 6 minutos de duración compuesta a finales de 2013. Esta obra resultó ganadora en el I Concurso de Composición para Orquesta Sinfónica “Emilio Lehmberg” organizado por el Conservatorio Superior de Música de Málaga. La obra fue estrenada el 7 de febrero de 2014 en la Sala Falla del Conservatorio Superior de Música de Málaga por la orquesta sinfónica del Conservatorio Superior de Música de Málaga, compuesta por alumnos del centro, bajo la batuta del profesor de dirección de orquesta, David García Carmona. A continuación, se incluye la videograbación del estreno.

The presence of K-pop production in Norway is not coincidental. Norwegian producers have identified economic and artistic market opportunities in Korea, while the Korean music industry has encouraged non-domestic producers to export pop... more

The presence of K-pop production in Norway is not coincidental. Norwegian producers have identified economic and artistic market opportunities in Korea, while the Korean music industry has encouraged non-domestic producers to export pop songs to the Korean market and to collaborate with Korean producers. This model of production is a part of corporate globalization strategies such as S.M. Entertainment’s “cultural technology.” Yet, literature is scarce on the music in K-pop and its production. This thesis explores the musical content and production practices in transnationally produced K-pop tracks, and further investigates how international and transnational collaborations work in the production of K-pop. It attempts to bridge the gap between the most occurring themes of Hallyu research – culture, fandom, cultural geography, economics, ethnography or an interdisciplinary amalgam of these – and K-pop’s musical content.

Pilar Bayona, one of the foremost Spanish pianists of the twentieth century, has been the baseline of this study as her fi gure has hitherto not been studied from the technical and interpretive perspective. Their interpretations of... more

Pilar Bayona, one of the foremost Spanish pianists of the
twentieth century, has been the baseline of this study as her fi gure
has hitherto not been studied from the technical and interpretive
perspective. Their interpretations of Spanish music were particularly
interesting.
The aim of this work is to analyse some of recorded fragments
from this musical style she has left. To undertake this task I use my
own work methodology that allows to make objective assessments of
her individual performance using sound spectrograms, pictures and
other graphs derived from them.

Laptop composition – the creation and performance of music primarily using laptop computers – emerged as an important musical activity in the last decade of the twentieth century. While much has been written about the cultural and... more

Laptop composition – the creation and performance of music primarily using laptop computers – emerged as an important musical activity in the last decade of the twentieth century. While much has been written about the cultural and conceptual significance of this new music, less has been published regarding the sonic structure of specific works. This article explores the musical structure and design of compositions by three laptop composers at the turn of the millennium: 'Untitled #2' by Oval (Markus Popp), 'Cow Cow' by Merzbow (Masami Akita), and 'Powerbookfiend' by Kid606 (Miguel De Pedro). Each piece is analysed using spectrographic images, representations of musical sound that allow for the precise measurement of frequency and intensity. Repetition and noise are revealed as musical characteristics common to all three pieces, defining both smaller-scale patterns and large-scale designs. Using the conceptual vocabulary of Paul Virilio and Gilles Deleuze, repetition and noise are framed in relation to a 'machine aesthetic' and 'difference and repetition'. With the turn of the millennium computer music has become ubiquitous. Advances in computer technology allow composers not only to distribute their music more widely (by creating sound files for instant dissemination or posting on the internet) but also to create and shape sound itself into seemingly infinite forms. From this plethora of sonic bits and bytes a new generation of composers has emerged. These composers perform in concert and create recorded tracks using the laptop computer as their primary instrument.

This unique study investigates the frequency spectrum of certain Rosicrucian vowel chants using PRAAT (Boersma 2001, 341-345) software and analysis. Certain Rosicrucian vowel sound intonations, which are shared by other traditions, were... more

This unique study investigates the frequency spectrum of certain Rosicrucian vowel chants using PRAAT (Boersma 2001, 341-345) software and analysis. Certain Rosicrucian vowel sound intonations, which are shared by other traditions, were pre-recorded in a recording studio and analyzed. Given that many physiological, emotional, and psychic effects are experienced by both the intoner and the listener when these vowel sounds are produced, the frequency characteristics of such sounds may lead to correlations that reveal the cause of their health-producing effects. Studies of intensity vs. time and spectral characteristics related to the pitch and formant of the sound were also considered. It is anticipated that the results will provide a deeper understanding of the relationships between frequencies and their possible " resonant " effects on tissues. Sommaire Cette étude unique étudie le spectre de fréquence de certains chants de voyelles rosicruciens en utilisant le logiciel PRAAT (Boersma 2001, 341-345) et l'analyse. Certaines intonations de voyelles Rosicruciennes partagées par d'autres traditions ont été préenregistrées dans un studio d'enregistrement et analysées. Étant donné que de nombreux effets physiologiques, émotionnels et psychiques sont ressentis par le chanteur et l'auditeur lorsque ces sons vocaliques sont produits, les caractéristiques de fréquence de ces sons peuvent conduire à des corrélations qui révèlent la cause de leurs effets bénéfiques sur la santé. Des études de l'intensité en fonction du temps et des caractéristiques spectrales liées à la hauteur et au formant du son ont également considérées. On s'attend à ce que les résultats fournissent une compréhension plus profonde des relations entre les fréquences et de leurs possibles effets « résonnants » sur les tissus. Análisis Fourier de la entonación de ciertos sonidos vocales místicos utilizando el sistema PRAAT

This paper reveals concordances between pitch space and vowel sounds in Hildegard von Bingen’s O rubor sanguinis. The antiphon not only displays a cogent pitch and register structure, but also projects a vivid timbre design created... more

This paper reveals concordances between pitch space and vowel sounds in Hildegard von Bingen’s O rubor sanguinis. The antiphon not only displays a cogent pitch and register structure, but also projects a vivid timbre design created through the vowels of the text. I argue that vowel sounds correspond to, and help corroborate, the antiphon’s progression through pitch space—thus both vowel sounds and pitch space work in tandem to express the imagery and meaning of the text. I begin by discussing medieval language and music connections as revealed through current research as well as in medieval theoretical treatises. Next, I discuss the concept of tetrachordal pitch space as understood during the twelfth century, highlighting specific links between vowel sounds and these tetrachordal regions in the antiphon. Two performances of the antiphon are then compared using computer images, or spectrographs, which display the physical sonic profile of the combined vowel and pitch sounds. These images demonstrate how specific language-sound transformations seen in the text are also reflected in the musical timbre of the performances. Throughout my analysis I explore the domain of semantics, identifying ways in which the music’s pitch space and language sounds help express the meaning of the words.

ABSTRACT This paper presents a novel approach to categorize dolphin whistles into various types. Most accurate methods to identify dolphin whistles are tedious and not robust, especially in the presence of ocean noise. One of the biggest... more

ABSTRACT This paper presents a novel approach to categorize dolphin whistles into various types. Most accurate methods to identify dolphin whistles are tedious and not robust, especially in the presence of ocean noise. One of the biggest challenges of dolphin whistle extraction is the coexistence of short-time duration wide-band echo clicks with the whistles. In this research a subspace of select orientation parameters of the 2-D Gabor wavelet frames is utilized to enhance or suppress signals by their orientation. The result is a Gabor image that contains a noise free grayscale representation of the fundamental dolphin whistle which is resampled and fed into the Sparse Representation Classifier. The classifier uses the l1-norm to select a match. Experimental studies conducted demonstrate: (a) a robust technique based on the Gabor wavelet filters in extracting reliable call patterns, and (b) the superior performance of Sparse Representation Classifier for identifying dolphin whistles by their call type.

Most of the current supervised automatic music transcription (AMT) models lack the ability to generalize. This means that they have trouble transcribing real-world music recordings from diverse musical genres that are not presented in the... more

Most of the current supervised automatic music transcription (AMT) models lack the ability to generalize. This means that they have trouble transcribing real-world music recordings from diverse musical genres that are not presented in the labelled training data. In this paper, we propose a semi-supervised framework, ReconVAT, which solves this issue by leveraging the huge amount of available unlabelled music recordings. The proposed ReconVAT uses reconstruction loss and virtual adversarial training. When combined with existing U-net models for AMT, ReconVAT achieves competitive results on common benchmark datasets such as MAPS and MusicNet. For example, in the few-shot setting for the string part version of MusicNet, ReconVAT achieves F1-scores of 61.0% and 41.6% for the note-wise and note-with-offset-wise metrics respectively, which translates into an improvement of 22.2% and 62.5% compared to the supervised baseline model. Our proposed framework also demonstrates the potential of continual learning on new data, which could be useful in real-world applications whereby new data is constantly available. CCS CONCEPTS • Applied computing → Sound and music computing; • Computing methodologies → Semi-supervised learning settings; Neural networks.

The periodic inspection of railroad tracks is very important to find structural and geometrical problems that lead to railway accidents. Currently, in Pakistan, rail tracks are inspected by an acoustic-based manual system that requires a... more

The periodic inspection of railroad tracks is very important to find structural and geometrical problems that lead to railway accidents. Currently, in Pakistan, rail tracks are inspected by an acoustic-based manual system that requires a railway engineer as a domain expert to differentiate between different rail tracks’ faults, which is cumbersome, laborious, and error-prone. This study proposes the use of traditional acoustic-based systems with deep learning models to increase performance and reduce train accidents. Two convolutional neural networks (CNN) models, convolutional 1D and convolutional 2D, and one recurrent neural network (RNN) model, a long short-term memory (LSTM) model, are used in this regard. Initially, three types of faults are considered, including superelevation, wheel burnt, and normal tracks. Contrary to traditional acoustic-based systems where the spectrogram dataset is generated before the model training, the proposed approach uses on-the-fly feature extract...

While analyses of vocal music often highlight connections between textual meaning and musical structure, less common is the study of language sounds in relation to musical structure. This presentation explores Guillaume de Machaut’s... more

While analyses of vocal music often highlight connections between textual meaning and musical structure, less common is the study of language sounds in relation to musical structure. This presentation explores Guillaume de Machaut’s virelai Tuit mi penser, highlighting connections that exist between language sounds, pitch space and meaning. Vowel and consonant content is analyzed in the text, and computer images are used to study the language sounds of an actual performance. One of my main conclusions is that Tuit mi penser not only displays a cogent pitch space structure, but also projects a vivid sound design created through the language sounds of the text. The language sounds, pitches, and registers often reinforce one another, working in tandem to convey the imagery and meaning of the text.
.

It has been well documented that Humpack whales produce songs with a specific structure [Payne]. The NIPS4B challenge provides 26 minutes of a remarkable Humpback whale song recording produced at few meters distance from the whale in La... more

It has been well documented that Humpack whales produce songs with a specific structure [Payne].
The NIPS4B challenge provides 26 minutes of a remarkable Humpback whale song recording produced at
few meters distance from the whale in La Reunion - Indian Ocean, by "Darewin" research group in 2013 at a
frequency sampling of 44.1kHz, 32 bits, mono, wav format (Fig 1).
Usually, the Mel Filter Cepstrum Coefficients are used as parameters to describe these songs [Pace
and al.] We propose here another efficient representation, the scalogram, and we demonstrate that the sea
noise is efficiently removed, even in the case of lower SNR recordings, allowing robust song representations.

Decreasing road accidents rate and increasing road safety have been the major concerns for a long time as traffic accidents expose the divers, passengers, properties to danger. Driver fatigue and drowsiness are one of the most critical... more

Decreasing road accidents rate and increasing road safety have been the major concerns for a long time as traffic accidents expose the divers, passengers, properties to danger. Driver fatigue and drowsiness are one of the most critical factors affecting road safety, especially on highways. EEG signal is one of the reliable physiological signals used to perceive driver fatigue state but wearing a multi-channel headset to acquire the EEG signal limits the EEG based systems among drivers. The current work suggested using a driver fatigue detection system using transfer learning, depending only on one EEG channel to increase system usability. The system firstly acquires the signal and passing it through preprocessing filtering then, converts it to a 2D spectrogram. Finally, the 2D spectrogram is classified with AlexNet using transfer learning to classify it either normal or fatigue state. The current study compares the accuracy of seven EEG channel to select one of them as the most accurate channel to depend on it for classification. The results show that the channels FP1 and T3 are the most effective channels to indicate the drive fatigue state. They achieved an accuracy of 90% and 91% respectively. Therefore, using only one of these channels with the modified AlexNet CNN model can result in an efficient driver fatigue detection system.

Facial and head actions contain significant affective information. To date, these actions have mostly been studied in isolation because the space of naturalistic combinations is vast. Interactive visualization tools could enable new... more

Facial and head actions contain significant affective information. To date, these actions have mostly been studied in isolation because the space of naturalistic combinations is vast. Interactive visualization tools could enable new explorations of dynamically changing combinations of actions as people interact with natural stimuli. This paper describes a new open-source tool that enables navigation of and interaction with dynamic face and gesture data across large groups of people, making it easy to see when multiple ...

This paper proposes an active contour algorithm for spectrogram track detection. It extends upon previously published work in a number of areas, previously published internal and potential energy models are refined and theoretical... more

This paper proposes an active contour algorithm for spectrogram track detection. It extends upon previously published work in a number of areas, previously published internal and potential energy models are refined and theoretical motivations for these changes are offered. These refinements offer a marked improvement in detection performance, including a notable reduction in the probability of false positive detections. The result is feature extraction at signal-to-noise ratios as low as− 1dB in the frequency domain.

A Byzantine Music piece performed by a well recognized chanter is used in order to derive experimentally the mean frequencies of the first five tones (D – A) of the diatonic scale of Byzantine Music. Then the experimentally derived... more

A Byzantine Music piece performed by a well recognized chanter is used in order to derive experimentally the mean frequencies of the first five tones (D – A) of the diatonic scale of Byzantine Music. Then the experimentally derived frequencies are compared with frequencies proposed by two theoretical scales, both representative of traditional Byzantine Music chanting. We found that if a scale is performed by a traditional chanter, it is very close in frequency to the frequencies proposed theoretically, except tone F. An allowed frequency deviation from the mean frequencies for each tone is determined. The concept of allowed deviation is not provided by theory. Comparing our results to the notion of pitch discrimination from psychophysics it is further established that the frequency differences are minute. The Attraction Effect is tested for a secondary tone (E) and the effect is quantified for the first time. The concept of the Attraction Effect has not been explained in theory in t...

This article explores the timbre of Thai classical singing with an emphasis on uan, a wordless vocalization based upon specific vowels and consonants, using spectrographic images. Spectrographs are computer images that visualize the... more

This article explores the timbre of Thai classical singing with an emphasis on uan, a wordless vocalization based upon specific vowels and consonants, using spectrographic images. Spectrographs are computer images that visualize the fundamental, overtones, and noise-like sounds of a performance. Uan can be reduced to five basic sounds and are analyzed individually and then placed within a musical context. Performances from different generations and musical lineages are contrasted and compared in relation to timbre. I argue it is the relative timbral consistency of the five basic uan sounds, connecting musical generations and lineages, combined with the unique characteristics of each performance that defines the distinct sound of Thai classical singing.