Identification of one- and two-formant steady-state vowels: A model and experiments (original) (raw)

Formant-frequency discrimination for isolated English vowels

The Journal of the Acoustical Society of America, 1994

Thresholds for formant-frequency discrimination were obtained for ten synthetic English vowels patterned after a female talker. To estimate the resolution of the auditory system for these stimuli, thresholds were measured using well-trained subjects under minimal-stimulusuncertainty procedures. Thresholds were estimated for both increments and decrements in formant frequency for the first and second formants. Reliable measurements of threshold were obtained for most formants tested, the exceptions occurring when a harmonic of the fundamental was aligned with the center frequency of the test formant. In these cases, unusually high thresholds were obtained from some subjects and asymmetrical thresholds were measured for increments versus decrements in formant frequency. Excluding those cases, thresholds for formant frequency, AF, are best described as a piecewise-linear function of frequency which is constant at about 14 Hz in the F1 frequency region ( < 800 Hz), and increases linearly in the F2 region. In the F2 region, the resolution for formant frequency is approximately 1.5%. The present thresholds are similar to previous estimates in the F1 region, but about a factor of three lower than those in the F2 region. Comparisons of these results to those for pure tones and for complex, nonspeech stimuli are discussed. PACS numbers: 43.71.Es, 43.66.Fe INTRODUCTION Understanding human listeners' abilities to process the sounds of speech requires a solid empirical data base, starting with measures of the identification of sounds as particular phonemes. In addition, however, it is also important to obtain reliable estimates of the resolving power, or discrimination capabilities, of the auditory system for these same stimuli. Studies with both nonspeech and speech sounds (Watson et al., 1976; Watson, 1987; Kewley-Port et al., 1988) have demonstrated very large effects of the duration of training, psychophysical methods, and the level of trial-to-trial stimulus uncertainty, on discrimination thresholds. The purpose of the research presented here was not to repeat earlier studies demonstrating the results of these various procedural variables, but rather to obtain discrimination data for an important category of speech stimuli, namely vowels. Procedures previously demonstrated to be optimal for obtaining discrimination thresholds were employed including minimal stimulus uncertainty and the use of well-trained subjects. The specific properties of vowels that were studied were changes in formant frequencies of F1 and F2. There have been several previous studies of this particular ability, but most of them were either run with a small set of vowels, a small number of subjects, or procedures that are likely nonoptimal. Given the fundamental importance of vowel discrimination in speech processing, a replication and systematic extension of this earlier work seemed worthwhile. This report is the first in a series which examines vowel a)Some material in this manuscript was presented at the 119th Meeting of the Acoustical Society of America [J. Acoust. Soc. Am. Suppl. 1 87, S159 (1990)]. discrimination in a variety of phonetic contexts and at different levels of stimulus uncertainty. We have begun this series of studies using isolated synthetic, steady-state vowels. It is generally agreed that the formant-frequency values of F1 and probably F2 are the most important properties of vowel spectra which contribute to phonetic identification. Thus many studies, beginning with Flanagan's in 1955, have estimated the just-noticeable change in frequency (AF) for a single formant frequency, a change that is commonly reported as a Weber ratio, AF/F (Flanagan, 1955; Mermelstein, 1978; Nord and Sventelius, 1979; Nakagawa etal., 1982; Hermansky, 1987; Sinnott and Kreiter, 1991 ). Studies of the discrimination of changes in other properties of vowel spectra have included formant bandwidth (Flanagan, 1957), both formant and bandwidth (Ghitza and Goldstein, 1983; Gagne and Zurek, 1988), and more global aspects of vowel spectra (Klatt, 1982; Bladon and Lindblom, 1980).

Auditory models of formant frequency discrimination for isolated vowels

The Journal of the Acoustical Society of America, 1998

Thresholds for formant discrimination of female and male vowels are significantly elevated by two stimulus factors, increases in formant frequency and fundamental frequency ͓Kewley-Port et al., J. Acoust. Soc. Am. 100, 2462-2470 ͑1996͔͒. The present analysis systematically examined whether auditory models of vowel sounds, including excitation patterns, specific loudness, and a Gammatone filterbank, could explain the effects of stimulus parameters on formant thresholds. The goal was to determine if an auditory metric could be specified that reduced variability observed in the thresholds to a single-valued function across four sets of female and male vowels. Based on Sommers and Kewley-Port ͓J. Acoust. Soc. Am. 99, 3770-3781 ͑1996͔͒, four critical bands around the test formant were selected to calculate a metric derived from excitation patterns. A metric derived from specific loudness difference ͑⌬Sone͒ was calculated across the entire frequency region. Since analyses of spectra from Gammatone filters gave similar results to those derived from excitation patterns, only the 4-ERB ͑equivalent rectangular bandwidth͒ and ⌬Sone metrics were analyzed in detail. Three criteria were applied to the two auditory metrics to determine if they were single-valued functions relative to formant thresholds for female and male vowels. Both the 4-ERB and ⌬Sone metrics met the criteria of reduced slope, reduced effect of fundamental frequency, although ⌬Sone was superior to 4-ERB in reducing overall variability. Results suggest that the auditory system has an inherent nonlinear transformation in which differences in vowel discrimination thresholds are almost constant in the internal representation.

Systematic errors in the formant analysis of steady-state vowels

Speech communication, 2002

The locations of formants in a speech signal are usually estimated by computing the linear predictive coefficients (LPC) over a sliding window and finding the peaks in the spectrum of the resulting LP filter. The peak locations are estimated either by root-solving or by computing a coarse spectrum and finding its maxima. We discuss four sources of systematic error in this analysis: (1) quantization of the speech signal due to the fundamental frequency, (2) incorrect order for the LP filter, (3) exclusive reliance upon root-solving, and (4) the three-point parabolic interpolation used to compensate for the coarse spectrum. We show that the expected error due to F 0 quantization is $10% of F 0, and that the other three sources can independently skew the final formant estimates by 10-80 Hz. We also show that errors due to incorrect filter order are related to systematic differences between speakers and phonetic classes, and that root-solving is especially error-prone for low formants or when formants are close to each other. We discuss methods for avoiding these errors and improving the accuracy of formant estimation, and give a heuristic for estimating the optimal filter order of a steady-state signal.

Vowel formant discrimination in ordinary listening conditions. I

The Journal of the Acoustical Society of America, 1996

Thresholds for formant frequency discrimination have been established previously using optimal listening conditions. In normal conversation, the ability to discriminate formant frequency is probably substantially degraded. The present study examined formant discrimination under more ordinary listening conditions characterized as tasks under higher levels of stimulus uncertainty with vowels embedded in syllables, phrases and sentences, and with the addition of a sentence identification task. Four vowels synthesized from a female talker were presented in isolation, or in the phonetic context of /bVd/ syllables, three-word phrases or nine-word sentences. In the first experiment, phonetic context was manipulated in a novel experimental protocol to estimate formant discrimination. Undesirable training effects were observed and led to the design of a new protocol to reduce this problem for the second experiment. This formant discrimination experiment manipulated both length of phonetic context and level of difficulty of a competing sentence identification task. Similar discrimination results were obtained in both experiments. The effect of longer phonetic context on formant discrimination is compressive such that no difference in formant resolution was found for vowels embedded in the phrase or sentence contexts. The addition of a challenging sentence identification task to the discrimination task did not further degrade performance and a stable pattern for formant discrimination emerged. This norm for the resolution of vowel formants under ordinary listening conditions was shown to be a constant of 0.28 Barks. Analysis of the American English vowel space determined that the closest vowels, on average, were 0.56 Barks apart, that is a factor of two larger than the perceptual constraint of vowel formant discrimination under ordinary listening conditions. ____________ a) Portions of these data were presented at the 132nd meeting of the Acoustical Society of America [J.

Formant discrimination in noise for isolated vowels

The Journal of the Acoustical Society of America, 2004

Formant discrimination for isolated vowels presented in noise was investigated for normal-hearing listeners. Discrimination thresholds for F1 and F2, for the seven American English vowels /{, (, }, ,, #,~, É/, were measured under two types of noise, long-term speech-shaped noise ͑LTSS͒ and multitalker babble, and also under quiet listening conditions. Signal-to-noise ratios ͑SNR͒ varied from Ϫ4 to ϩ4 dB in steps of 2 dB. All three factors, formant frequency, signal-to-noise ratio, and noise type, had significant effects on vowel formant discrimination. Significant interactions among the three factors showed that threshold-frequency functions depended on SNR and noise type. The thresholds at the lowest levels of SNR were highly elevated by a factor of about 3 compared to those in quiet. The masking functions ͑threshold vs SNR͒ were well described by a negative exponential over F1 and F2 for both LTSS and babble noise. Speech-shaped noise was a slightly more effective masker than multitalker babble, presumably reflecting small benefits ͑1.5 dB͒ due to the temporal variation of the babble.

Modeling formant frequency discrimination of female vowels

The Journal of the Acoustical Society of America, 1996

The present investigations were designed to establish the features of vowel spectra that mediate formant frequency discrimination. Thresholds for detecting frequency shifts in the first and second formants of two steady-state vowels were initially measured for conditions in which the amplitudes of all harmonics varied in accordance with a model of cascade formant synthesis. In this model, changes in formant frequency produce level variations in components adjacent to the altered formant as well as in harmonics spectrally remote from the shifted resonant frequency. Discrimination thresholds determined with the cascade synthesis procedure were then compared to difference limens ͑DLs͒ obtained when the number of harmonics exhibiting level changes was limited to the frequency region surrounding the altered formant. Results indicated that amplitude variations could be restricted to one to three components near the shifted formant before significant increases in formant frequency DLs were observed. In a second experiment, harmonics remote from the shifted formant were removed from the stimuli. In most cases, thresholds for these reduced-harmonic complexes were not significantly different from those obtained with full-spectrum vowels. Preliminary evaluation of an excitation-pattern model of formant frequency discrimination indicated that such a model can provide good accounts of the thresholds obtained in the present experiments once the salient regions of the vowel spectra have been identified. Implications of these findings for understanding the mechanism mediating vowel perception are discussed.

Experiments in Speech Perception : Phonetics Research Seminar 1978-1979

1979

An experiment was designed to investigate whether the perception of vowel quality is affected by the interaction that occurs at the acous tic level between a formant peak and the strongest harmonics within that formant. Synthetic one-formant stimuli were used in an identifi cation task involving [ a.J , [:>J, [ oJ as responses. Three formant fre quencies were experimentally established for these vowels and FO values were selected for each vowel so as to produce harmonic configurations with the strongest harmonic deviating both maximally and minimally from the formant frequency. The results support the claim that the phonetic quality of a given formant pattern is a monotonic function of FO and is not influenced by a "strongest harmonic" effect.-13-Why does raJ change to [�J when FO is increased? : Interplay between harmonic structure and formant frequency in the perception of vowel quality.-23-2.3 Analysis and p rediction of difference limen data for formant frequencies.

Vowel formant discrimination: Towards more ordinary listening conditions

The Journal of the Acoustical Society of America, 1999

Thresholds for formant frequency discrimination have been established using optimal listening conditions. In normal conversation, the ability to discriminate formant frequency is probably substantially degraded. The purpose of the present study was to change the listening procedures in several substantial ways from optimal towards more ordinary listening conditions, including a higher level of stimulus uncertainty, increased levels of phonetic context, and with the addition of a sentence identification task. Four vowels synthesized from a female talker were presented in isolation, or in the phonetic context of /bVd/ syllables, three-word phrases, or nine-word sentences. In the first experiment, formant resolution was estimated under medium stimulus uncertainty for three levels of phonetic context. Some undesirable training effects were obtained and led to the design of a new protocol for the second experiment to reduce this problem and to manipulate both length of phonetic context and level of difficulty in the simultaneous sentence identification task. Similar results were obtained in both experiments. The effect of phonetic context on formant discrimination is reduced as context lengthens such that no difference was found between vowels embedded in the phrase or sentence contexts. The addition of a challenging sentence identification task to the discrimination task did not degrade performance further and a stable pattern for formant discrimination in sentences emerged. This norm for the resolution of vowel formants under these more ordinary listening conditions was shown to be nearly a constant at 0.28 barks. Analysis of vowel spaces from 16 American English talkers determined that the closest vowels, on average, were 0.56 barks apart, that is, a factor of 2 larger than the norm obtained in these vowel formant discrimination tasks.