Bert Cranen - Profile on Academia.edu (original) (raw)

Papers by Bert Cranen

Can voice quality measures predict strain tolerance

Item does not contain fulltex

Evaluation of Formant-Like

An aid in language teaching: The visualisation of pitch

System, 1984

Speech Communication, 2016

This paper investigates a computational model that combines a frontend based on an auditory model... more This paper investigates a computational model that combines a frontend based on an auditory model with an exemplar-based sparse coding procedure for estimating the posterior probabilities of sub-word units when processing noisified speech. Envelope modulation spectrogram (EMS) features are extracted using an auditory model which decomposes the envelopes of the outputs of a bank of gammatone filters into one lowpass and multiple bandpass components. Through a systematic analysis of the configuration of the modulation filterbank, we investigate how and why different configurations affect the posterior probabilities of sub-word units by measuring the recognition accuracy on a semantics-free speech recognition task. Our main finding is that representing speech signal dynamics by means of multiple bandpass filters typically improves recognition accuracy. This effect is particularly noticeable in very noisy conditions. In addition we find that to have maximum noise robustness, the bandpass filters should focus on low modulation frequencies. This reenforces our intuition that noise robustness can be increased by exploiting redundancy in those frequency channels which have long enough integration time not to suffer from envelope modulations that are solely due to noise. The ASR system we design based on these findings behaves more similar to human recognition of noisified digit strings than conventional ASR systems. Thanks to the relation between the modulation filterbank and procedures for computing dynamic acoustic features in conventional ASR systems, the finding can be used for improving the frontends in those systems.

The articulators of human speech might only be able to move slowly, which results in the gradual ... more The articulators of human speech might only be able to move slowly, which results in the gradual and continuous change of acoustic speech properties. Nevertheless, the so-called speech continuity is rarely explored to discriminate different phones. To exploit this, this paper investigates a multiple-frame MFCC representation (that is expected to retain sufficient time-continuity information) in combination with a supervised dimensionality reduction method, whose target is to find low-dimensional representations that optimally separates different phone classes. The speech continuity information is integrated into this framework by using the regularization terms that penalize discontinuities. Experimental results on TIMIT phonetic classification show that the use of regularizers can help to improve the separability of phone classes.

A method is presented for the automatic extraction of voice source parameters from speech. An aut... more A method is presented for the automatic extraction of voice source parameters from speech. An automatic i nverse filtering algorithm is used to obtain an esti mate of the glottal flow signal. Subsequently, an LF-model [1] is fitted to the glottal flow signal. In the current article we will focus on the improvement of the automatic fit procedure. To keep track of the performance of the fit procedure, a quantitative evaluation criterion is preferred. It is difficult to obtain such a criterion for natural speec h. Therefore, we propose an evaluation method in which synt hetic speech is used. We also conducted qualitative tests for disturbances that are often found in natural speech , i.e. source-filter interaction.

An effective way to increase the noise robustness of automatic speech recognition is to label noi... more An effective way to increase the noise robustness of automatic speech recognition is to label noisy speech features as either reliable or unreliable (missing) prior to decoding, and to replace the missing ones by clean speech estimates. We present a novel method to obtain such clean speech estimates. Unlike previous imputation frameworks which work on a frame-by-frame basis, our method focuses on exploiting information from a large time-context. Using a sliding window approach, denoised speech representations are constructed using a sparse representation of the reliable features in an overcomplete basis of fixed-length exemplar fragments. We demonstrate the potential of our approach with experiments on the AURORA-2 connected digit database.

The Impact of Spectral and Energy Mismatch on the

Within the Aurora2 experimental framework, the aim of this study is to determine what the relativ... more Within the Aurora2 experimental framework, the aim of this study is to determine what the relative contributions of spectral shape and energy features are to the mismatch observed between clean training and noisy test data. In addition to measurements on the baseline Aurora2 system, recognition performance was also evaluated after the application of time domain noise reduction (TDNR) and histogram normalisation (HN) in the cepstral domain. The results indicate that, for the Aurora2 digit recognition task, TDNR, HN, as well as a combination of the two techniques achieve higher recognition rates by reducing the mismatch in the energy part of acoustic feature space. The corresponding mismatch reduction in the spectral shape features yields hardly any gain in recognition performance.

Smoothing Trajectories by Regularization

Decoding speech by feature-based segmentation

Workshop ‘Where Do Features Come From? Phonological Primitives in the Brain, the Mouth and the Ear’

Locally learning heterogeneous manifolds for phonetic classification

Computer Speech & Language, 2016

Most state-of-the-art phone classifiers use the same features and decision criteria for all phone... more Most state-of-the-art phone classifiers use the same features and decision criteria for all phones, despite the fact that different broad classes are characterized by different manners and place of articulation that result in different acoustic features. This paper uses manifold learning to address structure in the acoustic space. Previous approaches to dimensionality reduction based on manifold learning assumed that the acoustic space can be characterized by a uniform manifold structure. In this paper we relax this assumption by learning different manifold structures for broad phonetic classes. Because all known classifiers make confusions between broad classes, we designed a two-level classifier in which the top level consists of a number of partially overlapping broad classes. Since the resulting classifiers are not statistically independent, we propose a new method for fusing the classifiers. Experimental results show that our two-level classifier obtained slightly better results when broad-class specific manifolds were learned, compared to a uniform manifold. However, the accuracy is still considerably lower than what could be obtained with oracle knowledge about broad class membership. From this we infer that phones do not form compact clusters in acoustic space.

Missingdataimputationusingcompressivesensingtechniquesfor Connecteddigitrecognition

Het fonetogram als discriminerend instrument

In deze studie worden het gebruik van fonetografie bij de beoordeling van de toekomstige beroepss... more In deze studie worden het gebruik van fonetografie bij de beoordeling van de toekomstige beroepsstem, alsook enkele parameters die voor dit doel uit het fonetogram zouden kunnen worden geextraheerd onderzocht. De data voor dit onderzoek zijn genomen uit fonetogrammen van stemmen van eerstejaars studenten van de lerarenopleiding. De onderzochte parameters zijn gerelateerd aan frequentie, intensiteit en de vorm van het fonetogram. De resultaten van deze studie suggereren dat het met een beperkte set van fonetogramparameters mogelijk is een onderscheid te maken tussen een controlegroep en een groep die op perceptieve gronden door logopedisten geselecteerd zijn als vatbaar voor stemafwijkingen, terwijl er in beide groepen geen subjectieve stemproblemen of stemplooiafwijkingen zijn. Parameters die gemeenschappelijk zijn voor beide geslachten en beide groepen van elkaar onderscheiden, zijn het melodische semitoon bereik, het maximale dynamische bereik en de helling van het steile deel van de bovenste contour. Na een onderverdeling naar geslacht lijken de laagste intensiteit voor de vrouw en de gemiddelde spreektoonhoogte voor de man parameters die indicatief zijn voor eventueel te verwachten stemproblemen. Suggesties voor verder onderzoek worden gedaan.

This study looks at two ways of extracting a glottal waveform from recorded speech. One way is to... more This study looks at two ways of extracting a glottal waveform from recorded speech. One way is to inverse filter the flow at the mouth. Another is to inverse filter the microphone signal. Theoretically, the microphone signal is considered to be the equivalent of a first order differentiation of the flow signal recorded at the lips. Recording the oral airflow is more complicated than the recording of a microphone signal, as it requires the use of a mask, with constant adjustments during the recording. Recording of the microphone signal is more straightforward for the experimenter and less intrusive for the subject. If the two inverse filtering procedures can be shown to produce similar glottal flow waveforms for both types of recorded speech, this would support the use of only the microphone signal for those types of glottal flow analysis where the DC component of the flow is not essetial, making voice source analysis applicable in less specialised situations. In this study, we used recordings of microphone signal and recordings of oral flow to compare the results of inverse filtering. A group of twenty subjects produced repetitions of the utterance /pae/. We recorded oral flow, EGG, and the microphone signal. The flow and microphone signals were analysed using an automatic inverse filtering program and values for parameters which are extracted from the source wave are compared. The results were not as similar as expected, although in some respects, they correlated well. This may be due to experimental design, the degree of insight of the subject into the voicing task, and the fact that the speech material used for the comparison was not identical.

Speech Communication, 1996

Speaker variability in the coarticulation of the vowels / a , i , u / was investigated in / C iVC... more Speaker variability in the coarticulation of the vowels / a , i , u / was investigated in / C iVC 2^/ pseudo-words, containing the consonants /p .t.k .d^n M .r /. These words were read out in isolation by fifteen male speakers of Dutch. The formants F |_3 (in Bark) were extracted from the steady-state of each vowel / a j , u /. Coarticulation in each of 1200 realisations per vowel was measured in F t _3 as a function of consonantal context, using a score-model based measure called CO ART. The largest amount of coarticulation was found in / u / where nasals and alveolars in exposition had the largest effect on the formant positions, especially on F2. Coarticulation in / a , u / proved to be speaker-specific. For these vowels the speaker variability of CO ART in a context was larger, generally, if CO ART itself was larger. Studied in a speaker identification task, finally, COART improved identification results only when three conditions were combined: (a) if COART was used as an additional parameter to F j_3; (b) if the COART-vaïues for the vowel were high; (c) if all vowel contexts were pooled in the analysis. The two main conclusions from this study are that coarticulation cannot be investigated speaker-independently and that COART can be contributive to speaker identification, but only in very restricted conditions. Zusammenfassung Die Sprechervariabilität in der Koartikulation der Vokale / a , i , u / wurde in / C,VC2a / Pseudowörtem untersucht, in denen die Konsonanten /p,t,k,d,s,m ,n,r/ enthalten waren. Die Wörter wurden von fünfzehn männlichen Sprechern des Niederländischen verlesen. Aus der Mitte jedes Vokals / a , i , u / wurden die Formanten F U3 (in Barks) extrahiert. Die Koartikulation in F ,_3 in jeder der 1200 Vokalrealisierungen wurde mithilfe eines modellbasierten Maßes, COART genannt, gemessen. Die stärkste Koartikulation wurde in / u / aufgefunden. Dabei hatten Nasale und Alveolare im Wortanlaut den größten Effekt auf die Formantiagen, besonders von F 2. Die Koartikulation in / a , u / erwies sich als sprecherspezifisch. Für diese Vokale war die Sprechervariabilität von COART gemeinhin größer, wenn COART selbst größer war. Schließlich wurde COART in einem Sprecheridentifikationsverfahren überprüft. COART erbrachte dabei nur unter drei kombinierten Voraussetzungen bessere Erkennungsergebnisse: (a) wenn COART den Formanten /r1" 3 hinzugefügt wurde; (b) wenn die COART-Werte des jeweiligen Vokals hoch waren; (c) wenn alle Vokalumgebungen in der Analyse einbezogen wurden. Die zwei wichtigsten Schlußfolgerungen dieser Studie sind, daß Koartikulation nicht sprecherunabhängig untersucht werden kann und daß COART der Sprecheridentifikation behilflich sein kann, aber nur in sehr eingeschränkten Bedingungen.

The phonochrome: A coherent spectro-temporal representation of sound

Hearing Research, 1981

Representation of simple stationary sounds can be given either in the temporal form by display of... more Representation of simple stationary sounds can be given either in the temporal form by display of the waveform as function of time or in the spectral form by intensity and phase as function of frequency. For complex nonstationary sounds, e.g. animal vocalisations and human speech, a combined spectro-temporal representation is more directly associated with auditory perception. The well-known sonogram or dynamic power spectrum has a fixed spectro-temporal resolution and neglects phase relations of different spectral and temporal sound components. In this paper the complex spectro-temporal intensity density CoSTID) is presented as a coherent spectro-temporal image of a sound, based on the analytic signal representation. The CoSTID allows an arbitrary form of the spectro-temporal resolution and preserves phase relations of different sound components. Since the CoSTID is a complex function of two variables, it leads naturally to the use of colour images for the spectro-temporal representation of sound: the phonochrome. The phonochromes are shown for different technical and natural sounds. Applications of this technique for study of phonation and audition and for biomedical signal processing are indicated.

Journal of Phonetics, 1995

In this paper we investigate acoustic backing-off as an operationalization of Missing Feature The... more In this paper we investigate acoustic backing-off as an operationalization of Missing Feature Theory with the aim to increase recognition robustness. Acoustic backing-off effectively diminishes the detrimental influence of outlier values by using a new model of the probability density function of the feature values. The technique avoids the need for explicit outlier detection. Situations that are handled best by Missing Feature Theory are those where only part of the coefficients are disturbed and the rest of the vector is unaffected. Consequently, one may predict that acoustic feature representations that smear local spectrotemporal distortions over all feature vector elements are inherently less suitable for automatic speech recognition. Our experiments seem to confirm this prediction. Using additive band limited noise as a distortion and comparing four different types of feature representations, we found that the best recognition performance is obtained with recognizers that use ...

Missing Data Imputation Using Compressive Sensing Techniques for Connected Digit Recognition

An effective way to increase the noise robustness of automatic speech recognition is to label noi... more An effective way to increase the noise robustness of automatic speech recognition is to label noisy speech features as either reliable or unreliable (missing) prior to decoding, and to replace the missing ones by clean speech estimates. We present a novel method based on techniques from the field of Compressive Sensing to obtain these clean speech estimates. Unlike previous imputation frameworks which work on a frame-by-frame basis, our method focuses on exploiting information from a large time-context. Using a sliding window approach, denoised speech representations are constructed using a sparse representation of the reliable features in an overcomplete dictionary of clean, fixed-length speech exemplars. We demonstrate the potential of our approach with experiments on the AURORA-2 connected digit database.

Can voice quality measures predict strain tolerance

Item does not contain fulltex

Evaluation of Formant-Like

An aid in language teaching: The visualisation of pitch

System, 1984

Speech Communication, 2016

The Impact of Spectral and Energy Mismatch on the

Smoothing Trajectories by Regularization

Decoding speech by feature-based segmentation

Workshop ‘Where Do Features Come From? Phonological Primitives in the Brain, the Mouth and the Ear’

Locally learning heterogeneous manifolds for phonetic classification

Computer Speech & Language, 2016

Missingdataimputationusingcompressivesensingtechniquesfor Connecteddigitrecognition

Het fonetogram als discriminerend instrument

Speech Communication, 1996

The phonochrome: A coherent spectro-temporal representation of sound

Hearing Research, 1981

Journal of Phonetics, 1995

Missing Data Imputation Using Compressive Sensing Techniques for Connected Digit Recognition

An effective way to increase the noise robustness of automatic speech recognition is to label noi... more An effective way to increase the noise robustness of automatic speech recognition is to label noisy speech features as either reliable or unreliable (missing) prior to decoding, and to replace the missing ones by clean speech estimates. We present a novel method based on techniques from the field of Compressive Sensing to obtain these clean speech estimates. Unlike previous imputation frameworks which work on a frame-by-frame basis, our method focuses on exploiting information from a large time-context. Using a sliding window approach, denoised speech representations are constructed using a sparse representation of the reliable features in an overcomplete dictionary of clean, fixed-length speech exemplars. We demonstrate the potential of our approach with experiments on the AURORA-2 connected digit database.