Martin Hagmüller | Graz University of Technology (original) (raw)
Papers by Martin Hagmüller
Language Resources and Evaluation, May 1, 2016
We introduce a unique, comprehensive Austrian German multi-sensor corpus with moving and non-movi... more We introduce a unique, comprehensive Austrian German multi-sensor corpus with moving and non-moving speakers to facilitate the evaluation of estimators and detectors that jointly detect a speaker's spatial and temporal parameters. The corpus is suitable for various machine learning and signal processing tasks, linguistic studies, and studies related to a speaker's fundamental frequency (due to recorded glottograms). Available corpora are limited to (synthetically generated/spatialized) speech data or recordings of musical instruments that lack moving speakers, glottograms, and/or multi-channel distant speech recordings. That is why we recorded 24 spatially non-moving and moving speakers, balanced male and female, to set up a two-room and 43-channel Austrian German multi-sensor speech corpus. It contains 8.2 hours of read speech based on phonetically balanced sentences, commands, and digits. The orthographic transcriptions include around 53,000 word tokens and 2,070 word types. Special features of this corpus are the laryngograph recordings (representing glottograms required to detect a speaker's instantaneous fundamental frequency and pitch), corresponding clean-speech recordings, and spatial information and video data provided by four Kinects and a camera.
Speech Communication, Nov 1, 2017
This paper presents the first large-scale analysis of pronunciation variation in conversational A... more This paper presents the first large-scale analysis of pronunciation variation in conversational Austrian German. Whereas for the varieties of German spoken in Germany, conversational speech has been given noticeable attention in the fields of linguistics and automatic speech recognition, for conversational Austrian there is a lack in speech resources and tools as well as linguistic and phonetic studies. Based on the recently collected GRASS corpus, we provide (methods for the creation of) a pronunciation dictionary and (tools for the creation of) broad phonetic transcriptions for Austrian German. Subsequently, we present a comparative analysis of the occurrence of phonological and reduction rules in read and conversational speech. We find that whereas some rules are specific for the Austrian Standard variant and thus occur in both speech styles (e.g., the realization of /z/ as [s]), other rules are specific for conversational speech (e.g., the realization of /a:/ as [o:]. Overall, our results show that less words are produced with the citation form for conversational Austrian German (37.8%) than for other languages of the same style (e.g., Dutch conversations: 56%).
Journal of Voice, Mar 1, 2017
Objectives. Diplophonia is an often misinterpreted symptom of disordered voice, and needs objecti... more Objectives. Diplophonia is an often misinterpreted symptom of disordered voice, and needs objectification. An audio signal processing algorithm for the detection of diplophonia is proposed. Diplophonia is produced by two distinct oscillators, which yield a profound physiological interpretation. The algorithm's performance is compared with the clinical standard parameter degree of subharmonics (DSH). Study Design. This is a prospective study. Methods. A total of 50 dysphonic subjects with (28 with diplophonia and 22 without diplophonia) and 30 subjects with euphonia were included in the study. From each subject, up to five sustained phonations were recorded during rigid telescopic high-speed video laryngoscopy. A total of 185 phonations were split up into 285 analysis segments of homogeneous voice qualities. In accordance to the clinical group allocation, the considered segmental voice qualities were (1) diplophonic, (2) dysphonic without diplophonia, and (3) euphonic. The Diplophonia Diagram is a scatter plot that relates the one-oscillator synthesis quality (SQ1) to the two-oscillator synthesis quality (SQ2). Multinomial logistic regression is used to distinguish between diplophonic and nondiplophonic segments. Results. Diplophonic segments can be well distinguished from nondiplophonic segments in the Diplophonia Diagram because two-oscillator synthesis is more appropriate for imitating diplophonic signals than one-oscillator synthesis. The detection of diplophonia using the Diplophonia Diagram clearly outperforms the DSH by means of positive likelihood ratios (56.8 versus 3.6). Conclusions. The diagnostic accuracy of the newly proposed method for detecting diplophonia is superior to the DSH approach, which should be taken into account for future clinical and scientific work.
Electro-larynx speech (EL) is a possibility to re-obtain speech when the larynx is surgically rem... more Electro-larynx speech (EL) is a possibility to re-obtain speech when the larynx is surgically removed or damaged. As currently available devices normally are hand-held, a new generation of EL devices would benefit from a hands-free version. In this work we use electromyographic (EMG) signals to investigate speech/nonspeech detection for EL speech. The muscle activity, which is represented by the EMG signal, correlates with the intention to produce speech sounds and therefore, the short-term energy can serve as a feature to make a speech/non-speech decision. We developed a data acquisition hardware to record EMG signals using surface electrodes. We then recorded a small database with parallel recordings of EMG and EL speech and used different approaches to classify the EMG signal into speech/non-speech sections. We compared the following envelope calculation methods: root mean square, Hilbert envelope, and low-pass filtered envelope, and different classification methods: single threshold, double threshold and a Gaussian mixture model based classification. This study suggests that the results are speaker dependent, i.e. they strongly depend on the signal-to-noise ratio of the EMG signal. We show that using low-pass filtered envelope together with double threshold detection outperforms the rest.
In computerized lung sound research, the usage of a pneumotachograph, defining the phase of respi... more In computerized lung sound research, the usage of a pneumotachograph, defining the phase of respiration and airflow velocity, is essential. To obviate its need, the influence of airflow rate on the characteristics of lung sounds is of great interest. Therefore, we investigate its effect on amplitude and regional distribution of normal lung sounds. We record lung sounds on the posterior chest of four lung-healthy male subjects in supine position with a 16-channel lung sound recording device at different airflow rates. We use acoustic thoracic images to discuss the influence of airflow rate on the regional distribution. At each airflow rate, we observe louder lung sounds over the left hemithorax and a constant regional distribution above an airflow rate of 0.7 l/s. Furthermore, we observe a linear relationship between the airflow rate and the amplitude of lung sounds.
Biomedical Signal Processing and Control, Aug 1, 2017
Schenk F, Aichinger P, Roesner I, Urschler M. Automatic high-speed video glottis segmentation usi... more Schenk F, Aichinger P, Roesner I, Urschler M. Automatic high-speed video glottis segmentation using salient regions and 3D geodesic active contours.
Determination of pitch marks (PMs) is necessary in clinical voice assessment for the measurement ... more Determination of pitch marks (PMs) is necessary in clinical voice assessment for the measurement of fundamental frequency (F0) and perturbation. In voice with ambiguous F0, PM determination is crucial, and its validity needs special attention. The study at hand proposes a new approach for PM determination from Laryngeal High-Speed Videos (LHSVs), rather than from the audio signal. In this novel approach, double PMs are extracted from a diplophonic voice sample, in order to account for ambiguous F0s. The LHSVs are spectrally analyzed in order to extract dominant oscillation frequencies of the vocal folds. Unit pulse trains with these frequencies are created as PM trains and compensated for the phase shift. The PMs are compared to Praat's single audio PMs. It is shown that double PMs are needed in order to analyze diplophonic voice, because traditional single PMs do not explain its double-source characteristic.
Folia Phoniatrica Et Logopaedica, 2016
Objectives: The aims of this study are to investigate the effects of diplophonia on jitter and sh... more Objectives: The aims of this study are to investigate the effects of diplophonia on jitter and shimmer and to identify measurement limitations with regard to material selection and clinical interpretation. Materials and Methods: Four hundred and ninety-eight audio samples of sustained phonations were analyzed. The audio samples were assessed for the grade of hoarseness and the presence of diplophonia. Jitter and shimmer were reported with regard to perceptual ratings. We investigated cycle marker positions exemplarily and qualitatively to understand their implications for perturbation measurements. Results: Medians of jitter and shimmer were higher for diplophonic voices than for nondiplophonic voices with equal grades of hoarseness. The variance of jitter for moderately dysphonic voices was larger than the variance observed in a corpus from which diplophonic samples had been discarded. The positions of cycle markers in diplophonic voices did not match the positions of the pulses, indicating that the validity of jitter and shimmer values for these voices were questionable. Conclusion: Diplophonia biases the reporting of dysphonia severity via perturbation measures, and their validity is questionable for these voices. In addition, diplophonia is an influential source of variance in jitter measurements. Thus, diplophonic fragments of voice samples should be excluded prior to perturbation analysis.
This paper presents a robust multichannel lung sound recording device (LSRD) for automatic lung s... more This paper presents a robust multichannel lung sound recording device (LSRD) for automatic lung sound classification. Compared to common approaches, we improved the usability and the robustness against body sounds and ambient noise. We developed a novel lung sound transducer (LST) and an appropriate attachment method realized as a foam pad. For analogue prefiltering, preamplification, and digitization of the lung sound signal, we use a composition of low-cost standard audio recording equipment. Furthermore, we developed a suitable recording software. In our experiments, we show the robustness of our LSRD against ambient noise, and we demonstrate the achieved signal quality. The LST's microphone features a signal-to-noise ratio of SNR = 80 dB. Therefore, we obtain a bandwidth of up to a frequency of f ≈ 2500 Hz for vesicular lung sound recordings. Compared to the attachment of the LST with self-adhesive tape, the foam pad achieves an attenuation of ambient noise of up to 50 dB in the relevant frequency range. The result of this work is a multichannel recording device, which enables a fast gathering of valuable lung sounds in noisy clinical environments without impeding the daily routines.
For automatic speech recognition (ASR) systems it is important that the input signal mainly conta... more For automatic speech recognition (ASR) systems it is important that the input signal mainly contains the desired speech signal. For a compact arrangement, differential microphone arrays (DMAs) are a suitable choice as front-end of ASR systems. The limiting factor of DMAs is the white noise gain, which can be treated by the minimum norm solution (MNS). In this paper, we introduce the first time the MNS to adaptive differential microphone arrays. We compare its effect to the conventional implementation when used as front-end of an ASR system. In experiments we show that the proposed algorithms consistently increase the word accuracy up to 50 % relative to their conventional implementations. For PESQ we achieve an improvement of up to 0.1 points.
The aim of this work is the estimation of respiratory flow from lung sound recordings, i.e. acous... more The aim of this work is the estimation of respiratory flow from lung sound recordings, i.e. acoustic airflow estimation. With a 16-channel lung sound recording device, we simultaneously record the respiratory flow and the lung sounds on the posterior chest from six lung-healthy subjects in supine position. For the recordings of four selected sensor positions, we extract linear frequency cepstral coefficient (LFCC) features and map these on the airflow signal. We use multivariate polynomial regression to fit the features to the airflow signal. Compared to most of the previous approaches, the proposed method uses lung sounds instead of trachea sounds. Furthermore, our method masters the estimation of the airflow without prior knowledge of the respiratory phase, i.e. no additional algorithm for phase detection is required. Another benefit is the avoidance of time-consuming calibration. In experiments, we evaluate the proposed method for various selections of sensor positions in terms of mean squared error (MSE) between estimated and actual airflow. Moreover, we show the accuracy of the method regarding a frame-based breathing-phase detection.
Speech Communication, Dec 1, 2006
This paper presents a security enhanced speaker verification system based on speech signal waterm... more This paper presents a security enhanced speaker verification system based on speech signal watermarking. Our proposed system can detect several situations where a playback speech, a synthetically generated speech, a manipulated speech signal or a hacker trying to imitate the speech is fooling the biometric system. In addition, we have generated a watermarked speech signals database from which we have obtained relevant conclusions about the influence of this technique on speaker verification rates. Mainly we have checked that biometrics and watermarking can coexist simultaneously minimizing the mutual effects. Experimental results show that the proposed speech watermarking system can suffer A-law coding with a message error rate lower than 2×10−4 for SWR higher than 20dB at a message rate of 48bits/s.
Centrality measures derived from character networks can be used to detect the main characters in ... more Centrality measures derived from character networks can be used to detect the main characters in a play. For example, previous research has shown that characters with high network centrality typically perform the majority of speech acts and appear in most of the scenes (Fischer, Trilcke, Kittel, Milling, & Skorinkin, 2018). However, one can extract character networks from plays in various ways: Close reading may omit minor characters like attendants or servants, e.g., (Moretti, 2011), while distant reading (e.g., parsing an XML file) may include aggregate characters like "All", "Both Lords", or similar. Furthermore, the networks may display either implicit or
info:eu-repo/semantics/publishe
Language Resources and Evaluation, May 1, 2016
We introduce a unique, comprehensive Austrian German multi-sensor corpus with moving and non-movi... more We introduce a unique, comprehensive Austrian German multi-sensor corpus with moving and non-moving speakers to facilitate the evaluation of estimators and detectors that jointly detect a speaker's spatial and temporal parameters. The corpus is suitable for various machine learning and signal processing tasks, linguistic studies, and studies related to a speaker's fundamental frequency (due to recorded glottograms). Available corpora are limited to (synthetically generated/spatialized) speech data or recordings of musical instruments that lack moving speakers, glottograms, and/or multi-channel distant speech recordings. That is why we recorded 24 spatially non-moving and moving speakers, balanced male and female, to set up a two-room and 43-channel Austrian German multi-sensor speech corpus. It contains 8.2 hours of read speech based on phonetically balanced sentences, commands, and digits. The orthographic transcriptions include around 53,000 word tokens and 2,070 word types. Special features of this corpus are the laryngograph recordings (representing glottograms required to detect a speaker's instantaneous fundamental frequency and pitch), corresponding clean-speech recordings, and spatial information and video data provided by four Kinects and a camera.
Speech Communication, Nov 1, 2017
This paper presents the first large-scale analysis of pronunciation variation in conversational A... more This paper presents the first large-scale analysis of pronunciation variation in conversational Austrian German. Whereas for the varieties of German spoken in Germany, conversational speech has been given noticeable attention in the fields of linguistics and automatic speech recognition, for conversational Austrian there is a lack in speech resources and tools as well as linguistic and phonetic studies. Based on the recently collected GRASS corpus, we provide (methods for the creation of) a pronunciation dictionary and (tools for the creation of) broad phonetic transcriptions for Austrian German. Subsequently, we present a comparative analysis of the occurrence of phonological and reduction rules in read and conversational speech. We find that whereas some rules are specific for the Austrian Standard variant and thus occur in both speech styles (e.g., the realization of /z/ as [s]), other rules are specific for conversational speech (e.g., the realization of /a:/ as [o:]. Overall, our results show that less words are produced with the citation form for conversational Austrian German (37.8%) than for other languages of the same style (e.g., Dutch conversations: 56%).
Journal of Voice, Mar 1, 2017
Objectives. Diplophonia is an often misinterpreted symptom of disordered voice, and needs objecti... more Objectives. Diplophonia is an often misinterpreted symptom of disordered voice, and needs objectification. An audio signal processing algorithm for the detection of diplophonia is proposed. Diplophonia is produced by two distinct oscillators, which yield a profound physiological interpretation. The algorithm's performance is compared with the clinical standard parameter degree of subharmonics (DSH). Study Design. This is a prospective study. Methods. A total of 50 dysphonic subjects with (28 with diplophonia and 22 without diplophonia) and 30 subjects with euphonia were included in the study. From each subject, up to five sustained phonations were recorded during rigid telescopic high-speed video laryngoscopy. A total of 185 phonations were split up into 285 analysis segments of homogeneous voice qualities. In accordance to the clinical group allocation, the considered segmental voice qualities were (1) diplophonic, (2) dysphonic without diplophonia, and (3) euphonic. The Diplophonia Diagram is a scatter plot that relates the one-oscillator synthesis quality (SQ1) to the two-oscillator synthesis quality (SQ2). Multinomial logistic regression is used to distinguish between diplophonic and nondiplophonic segments. Results. Diplophonic segments can be well distinguished from nondiplophonic segments in the Diplophonia Diagram because two-oscillator synthesis is more appropriate for imitating diplophonic signals than one-oscillator synthesis. The detection of diplophonia using the Diplophonia Diagram clearly outperforms the DSH by means of positive likelihood ratios (56.8 versus 3.6). Conclusions. The diagnostic accuracy of the newly proposed method for detecting diplophonia is superior to the DSH approach, which should be taken into account for future clinical and scientific work.
Electro-larynx speech (EL) is a possibility to re-obtain speech when the larynx is surgically rem... more Electro-larynx speech (EL) is a possibility to re-obtain speech when the larynx is surgically removed or damaged. As currently available devices normally are hand-held, a new generation of EL devices would benefit from a hands-free version. In this work we use electromyographic (EMG) signals to investigate speech/nonspeech detection for EL speech. The muscle activity, which is represented by the EMG signal, correlates with the intention to produce speech sounds and therefore, the short-term energy can serve as a feature to make a speech/non-speech decision. We developed a data acquisition hardware to record EMG signals using surface electrodes. We then recorded a small database with parallel recordings of EMG and EL speech and used different approaches to classify the EMG signal into speech/non-speech sections. We compared the following envelope calculation methods: root mean square, Hilbert envelope, and low-pass filtered envelope, and different classification methods: single threshold, double threshold and a Gaussian mixture model based classification. This study suggests that the results are speaker dependent, i.e. they strongly depend on the signal-to-noise ratio of the EMG signal. We show that using low-pass filtered envelope together with double threshold detection outperforms the rest.
In computerized lung sound research, the usage of a pneumotachograph, defining the phase of respi... more In computerized lung sound research, the usage of a pneumotachograph, defining the phase of respiration and airflow velocity, is essential. To obviate its need, the influence of airflow rate on the characteristics of lung sounds is of great interest. Therefore, we investigate its effect on amplitude and regional distribution of normal lung sounds. We record lung sounds on the posterior chest of four lung-healthy male subjects in supine position with a 16-channel lung sound recording device at different airflow rates. We use acoustic thoracic images to discuss the influence of airflow rate on the regional distribution. At each airflow rate, we observe louder lung sounds over the left hemithorax and a constant regional distribution above an airflow rate of 0.7 l/s. Furthermore, we observe a linear relationship between the airflow rate and the amplitude of lung sounds.
Biomedical Signal Processing and Control, Aug 1, 2017
Schenk F, Aichinger P, Roesner I, Urschler M. Automatic high-speed video glottis segmentation usi... more Schenk F, Aichinger P, Roesner I, Urschler M. Automatic high-speed video glottis segmentation using salient regions and 3D geodesic active contours.
Determination of pitch marks (PMs) is necessary in clinical voice assessment for the measurement ... more Determination of pitch marks (PMs) is necessary in clinical voice assessment for the measurement of fundamental frequency (F0) and perturbation. In voice with ambiguous F0, PM determination is crucial, and its validity needs special attention. The study at hand proposes a new approach for PM determination from Laryngeal High-Speed Videos (LHSVs), rather than from the audio signal. In this novel approach, double PMs are extracted from a diplophonic voice sample, in order to account for ambiguous F0s. The LHSVs are spectrally analyzed in order to extract dominant oscillation frequencies of the vocal folds. Unit pulse trains with these frequencies are created as PM trains and compensated for the phase shift. The PMs are compared to Praat's single audio PMs. It is shown that double PMs are needed in order to analyze diplophonic voice, because traditional single PMs do not explain its double-source characteristic.
Folia Phoniatrica Et Logopaedica, 2016
Objectives: The aims of this study are to investigate the effects of diplophonia on jitter and sh... more Objectives: The aims of this study are to investigate the effects of diplophonia on jitter and shimmer and to identify measurement limitations with regard to material selection and clinical interpretation. Materials and Methods: Four hundred and ninety-eight audio samples of sustained phonations were analyzed. The audio samples were assessed for the grade of hoarseness and the presence of diplophonia. Jitter and shimmer were reported with regard to perceptual ratings. We investigated cycle marker positions exemplarily and qualitatively to understand their implications for perturbation measurements. Results: Medians of jitter and shimmer were higher for diplophonic voices than for nondiplophonic voices with equal grades of hoarseness. The variance of jitter for moderately dysphonic voices was larger than the variance observed in a corpus from which diplophonic samples had been discarded. The positions of cycle markers in diplophonic voices did not match the positions of the pulses, indicating that the validity of jitter and shimmer values for these voices were questionable. Conclusion: Diplophonia biases the reporting of dysphonia severity via perturbation measures, and their validity is questionable for these voices. In addition, diplophonia is an influential source of variance in jitter measurements. Thus, diplophonic fragments of voice samples should be excluded prior to perturbation analysis.
This paper presents a robust multichannel lung sound recording device (LSRD) for automatic lung s... more This paper presents a robust multichannel lung sound recording device (LSRD) for automatic lung sound classification. Compared to common approaches, we improved the usability and the robustness against body sounds and ambient noise. We developed a novel lung sound transducer (LST) and an appropriate attachment method realized as a foam pad. For analogue prefiltering, preamplification, and digitization of the lung sound signal, we use a composition of low-cost standard audio recording equipment. Furthermore, we developed a suitable recording software. In our experiments, we show the robustness of our LSRD against ambient noise, and we demonstrate the achieved signal quality. The LST's microphone features a signal-to-noise ratio of SNR = 80 dB. Therefore, we obtain a bandwidth of up to a frequency of f ≈ 2500 Hz for vesicular lung sound recordings. Compared to the attachment of the LST with self-adhesive tape, the foam pad achieves an attenuation of ambient noise of up to 50 dB in the relevant frequency range. The result of this work is a multichannel recording device, which enables a fast gathering of valuable lung sounds in noisy clinical environments without impeding the daily routines.
For automatic speech recognition (ASR) systems it is important that the input signal mainly conta... more For automatic speech recognition (ASR) systems it is important that the input signal mainly contains the desired speech signal. For a compact arrangement, differential microphone arrays (DMAs) are a suitable choice as front-end of ASR systems. The limiting factor of DMAs is the white noise gain, which can be treated by the minimum norm solution (MNS). In this paper, we introduce the first time the MNS to adaptive differential microphone arrays. We compare its effect to the conventional implementation when used as front-end of an ASR system. In experiments we show that the proposed algorithms consistently increase the word accuracy up to 50 % relative to their conventional implementations. For PESQ we achieve an improvement of up to 0.1 points.
The aim of this work is the estimation of respiratory flow from lung sound recordings, i.e. acous... more The aim of this work is the estimation of respiratory flow from lung sound recordings, i.e. acoustic airflow estimation. With a 16-channel lung sound recording device, we simultaneously record the respiratory flow and the lung sounds on the posterior chest from six lung-healthy subjects in supine position. For the recordings of four selected sensor positions, we extract linear frequency cepstral coefficient (LFCC) features and map these on the airflow signal. We use multivariate polynomial regression to fit the features to the airflow signal. Compared to most of the previous approaches, the proposed method uses lung sounds instead of trachea sounds. Furthermore, our method masters the estimation of the airflow without prior knowledge of the respiratory phase, i.e. no additional algorithm for phase detection is required. Another benefit is the avoidance of time-consuming calibration. In experiments, we evaluate the proposed method for various selections of sensor positions in terms of mean squared error (MSE) between estimated and actual airflow. Moreover, we show the accuracy of the method regarding a frame-based breathing-phase detection.
Speech Communication, Dec 1, 2006
This paper presents a security enhanced speaker verification system based on speech signal waterm... more This paper presents a security enhanced speaker verification system based on speech signal watermarking. Our proposed system can detect several situations where a playback speech, a synthetically generated speech, a manipulated speech signal or a hacker trying to imitate the speech is fooling the biometric system. In addition, we have generated a watermarked speech signals database from which we have obtained relevant conclusions about the influence of this technique on speaker verification rates. Mainly we have checked that biometrics and watermarking can coexist simultaneously minimizing the mutual effects. Experimental results show that the proposed speech watermarking system can suffer A-law coding with a message error rate lower than 2×10−4 for SWR higher than 20dB at a message rate of 48bits/s.
Centrality measures derived from character networks can be used to detect the main characters in ... more Centrality measures derived from character networks can be used to detect the main characters in a play. For example, previous research has shown that characters with high network centrality typically perform the majority of speech acts and appear in most of the scenes (Fischer, Trilcke, Kittel, Milling, & Skorinkin, 2018). However, one can extract character networks from plays in various ways: Close reading may omit minor characters like attendants or servants, e.g., (Moretti, 2011), while distant reading (e.g., parsing an XML file) may include aggregate characters like "All", "Both Lords", or similar. Furthermore, the networks may display either implicit or
info:eu-repo/semantics/publishe