How stable are acoustic metrics of contrastive speech rhythm (original) (raw)
Related papers
Rhythm measures and dimensions of durational variation in speech
Journal of The Acoustical Society of America, 2011
Patterns of durational variation were examined by applying 15 previously published rhythm measures to a large corpus of speech from five languages. In order to achieve consistent segmentation across all languages, an automatic speech recognition system was developed to divide the waveforms into consonantal and vocalic regions. The resulting duration measurements rest strictly on acoustic criteria. Machine classification showed that rhythm measures could separate languages at rates above chance. Within-language variability in rhythm measures, however, was large and comparable to that between languages. Therefore, different languages could not be identified reliably from single paragraphs. In experiments separating pairs of languages, a rhythm measure that was relatively successful at separating one pair often performed very poorly on another pair: there was no broadly successful rhythm measure. Separation of all five languages at once required a combination of three rhythm measures. ...
The usefulness of metrics in the quantification of speech rhythm
Journal of Phonetics, 2012
The performance of the rhythm metrics ΔC, %V, PVIs and Varcos, said to quantify rhythm class distinctions, was tested using English, German, Greek, Italian, Korean and Spanish. Eight participants per language produced speech using three elicitation methods, spontaneous speech, story reading and reading a set of sentences divided into "uncontrolled" sentences from original works of each language, and sentences devised to maximize or minimize syllable structure complexity ("stress-timed" and "syllable-timed" sets respectively). Rhythm classifications based on pooled data were inconsistent across metrics, while cross-linguistic differences in scores were often statistically non-significant even for comparisons between prototypical languages like English and Spanish. Metrics showed substantial inter-speaker variation and proved very sensitive to elicitation method and syllable complexity, so that the size of both effects was large and often comparable to that of language. These results suggest that any cross-linguistic differences captured by metrics are not robust; metric scores range substantially within a language and are readily affected by a variety of methodological decisions, making crosslinguistic comparisons and rhythmic classifications based on metrics unsafe at best.
APPLYING DURATIONAL METRICS TO RECORDED SPEECH VS. TTS – EVIDENCE FROM 6 LANGUAGES
The present study is motivated by the observation that TTS samples of concatenative synthesis may often sound as a-rhythmic to the human ear. For this reason, we aimed at comparing samples of speech from real speakers (studio recordings) vs. synthetic samples. We followed a fairly complicated procedure in order to obtain comparable samples, including several steps. First, we selected 2000 recorded sentences of (each) English, French, German, Spanish, Italian and Japanese from bigger speech corpora (1 speaker per language). Then, we further divided them into (a) 1500 sentences to be included in the speech-base of the TTS system, and (b) 500 sentences for test. The former 1500 sentences were used to create TTS voices (for each of the six languages), with which we synthesized the latter 500 sentences. We thereby obtained 500 sentences in two flavours, i.e. as recorded samples and as TTS samples. These were all automatically segmented starting from the text, which was itself phonetically transcribed by our in-house software. The result of the transcription and segmentation was then converted to C and V intervals by an ad-hoc script and imported to R for analysis. Pearson’s correlation coefficient was calculated for each segment of recorded speech vs. TTS samples. Results show that correlation is high overall (ranging from 0.82 to 0.93 for the six speaker-TTS pairs) and prove that concatenative synthesis is able to reproduce global durational characteristics of speech. Values for the most popular rhythm metrics (deltas, %V, PVIs, CCIs) were also computed and plotted to charts to illustrate durational variability for the samples. Results for recorded samples of the six languages studied here reflect previous results reported in the literature: English and German samples tend to show greater durational variability than French, Spanish, Italian and Japanese; and they tend to have lower vocalic percentage. Results for TTS samples also show the very same trend. In fact, the mean values for each TTS system tends to sit very close to its recorded speech counterpart. Also, it is remarkable that no general trend was found to connect TTS samples with their recorded speech counterpart: that is to say, TTS samples may sometimes show (slightly) more or (slightly) less durational variability of C and V segments: there does not seem to be a general tendency in our data in this respect. We claim that traditional rhythm correlates are global measures that account for average durational variability of speech samples. They are good at giving a general overview of the rhythmic/timing properties of speech, but they are not able to detect specific local a-rhythmical phenomena found in TTS output, which are probably rooted in the prosodic pattern of the sentence. In the future, it would be desirable to develop acoustic indices that are able to detect such phenomena.
This study investigates the speech rhythm of Cantonese, Beijing Mandarin, Cantonese-accented English and Mandarinaccented English using acoustic rhythmic measures. They were compared with four languages in the BonnTempo corpus: German and English (stress-timed) and French and Italian (syllable-timed). Six Cantonese and six Beijing Mandarin native speakers were recorded reading the North Wind and the Sun story with a normal speech rate, telling the story semi-spontaneously and reading the English version of the story. Both raw and normalised rhythmic measures were calculated using vocalic, consonantal and syllabic durations (∆C, ∆V, ∆S, %V, VarcoC, VarcoV, VarcoS, rPVI_C, rPVI_S, nPVI_V, nPVI_S). Results confirm the syllabletiming impression of Cantonese and Mandarin. Data of the two foreign English accents poses a challenge to the rhythmic measures because the two accents are syllable-timed impressionistically but were classified as stress-timed by some of the rhythmic measures (∆C, rPVI_C, nPVI_V, ∆S, VarcoS, rPVI_S and nPVI_S). VarcoC and %V give the best classification of speech rhythm in this study.
What determines duration-based rhythm measures: text or speaker?
Laboratory Phonology, 2013
Differences in rhythm between languages have been often attributed to differences in phonological properties such as syllable structure. This paper uses quantitative analyses to determine whether and how popular duration-based rhythm measures depend on the phonological structure of a language. Native speakers of five languages read a large corpus of comparable texts (approximately 371,000 syllables in total). Phonological properties of each language were specified as 11 variables, computed from the phonetic transcriptions. These variables were compared against published rhythm measures that captured variation in duration of consonantal and vocalic intervals. While the text-based measures discriminated well between languages, the values of rhythm measures overlapped substantially, showing that the languages are more alike in acoustic implementation than in their phonological description. Multilevel models demonstrated that the mapping between phonological properties and acoustics is ...
Rhythmic variability between speakers: Articulatory, prosodic, and linguistic factors
The Journal of the Acoustical Society of America, 2015
Between-speaker variability of acoustically measurable speech rhythm [%V, ΔV(ln), ΔC(ln), and Δpeak(ln)] was investigated when within-speaker variability of (a) articulation rate and (b) linguistic structural characteristics was introduced. To study (a), 12 speakers of Standard German read seven lexically identical sentences under five different intended tempo conditions (very slow, slow, normal, fast, very fast). To study (b), 16 speakers of Zurich Swiss German produced 16 spontaneous utterances each (256 in total) for which transcripts were made and then read by all speakers (4096 sentences; 16 speaker × 256 sentences). Between-speaker variability was tested using analysis of variance with repeated measures on within-speaker factors. Results revealed strong and consistent between-speaker variability while within-speaker variability as a function of articulation rate and linguistic characteristics was typically not significant. It was concluded that between-speaker variability of a...
The dynamic dimension of the global speech-rhythm attributes
Annual Conference of the International Speech Communication Association, 2009
Recent years have revealed that certain global attributes of speech rhythm can be quite successfully captured with respect to consonantal and vocalic intervals in spoken texts. One of the problems of this approach lies in complex syllabic structures. Unless we make an a-priori phonological decision, sonorous consonants may contribute to either vocalic or consonantal part of the speech signal in post-initial and prefinal positions of syllabic onsets and codas. A procedure is offered to avoid phonological dilemmas together with tedious manual work. The method is tested on continuous Czech and English texts read out by several professionals.
The rhythm of text and the rhythm of utterances: from metrics to models
Tenth Annual Conference of the International Speech …, 2009
The typological classification of languages as stress-timed, syllable-timed and mora-timed did not stand up to empirical investigation which found little or no evidence for the different types of isochrony which had been assumed to be the basis for the classification. In recent years, there has been a renewal of interest with the development of empirical metrics for measuring rhythm. In this paper it is shown that some of these metrics are more sensitive to the rhythm of the text than to the rhythm of the utterance itself. While a number of recent proposals have been made for improving these metrics it is proposed that what is needed is more detailed studies of large corpora in order to develop more sophisticated models of the way in which prosodic structure is realised in different languages. New data on British English is presented using the Aix-Marsec corpus.