Computation of L2 Speech Rhythm based on Duration and F0 (original) (raw)

Comparing native and non-native speech rhythm using acoustic rhythmic measures: Cantonese, Beijing Mandarin and English

Proc. of Speech Prosody, Campinas, Brazil, 2008

This study investigates the speech rhythm of Cantonese, Beijing Mandarin, Cantonese-accented English and Mandarinaccented English using acoustic rhythmic measures. They were compared with four languages in the BonnTempo corpus: German and English (stress-timed) and French and Italian (syllable-timed). Six Cantonese and six Beijing Mandarin native speakers were recorded reading the North Wind and the Sun story with a normal speech rate, telling the story semi-spontaneously and reading the English version of the story. Both raw and normalised rhythmic measures were calculated using vocalic, consonantal and syllabic durations (ΔC, ΔV, ΔS, %V, VarcoC, VarcoV, VarcoS, rPVI_C, rPVI_S, nPVI_V, nPVI_S). Results confirm the syllabletiming impression of Cantonese and Mandarin. Data of the two foreign English accents poses a challenge to the rhythmic measures because the two accents are syllable-timed impressionistically but were classified as stress-timed by some of the rhythmic measures (ΔC, rPVI_C, nPVI_V, ΔS, VarcoS, rPVI_S and nPVI_S). VarcoC and %V give the best classification of speech rhythm in this study.

Mok, P. & Dellwo, V. (2008) Comparing native and non-native speech rhythm using acoustic rhythmic measures: Cantonese, Beijing Mandarin and English. In Proceedings of the 4th Speech Prosody (2008). 423-426. Campinas, Brazil.

This study investigates the speech rhythm of Cantonese, Beijing Mandarin, Cantonese-accented English and Mandarinaccented English using acoustic rhythmic measures. They were compared with four languages in the BonnTempo corpus: German and English (stress-timed) and French and Italian (syllable-timed). Six Cantonese and six Beijing Mandarin native speakers were recorded reading the North Wind and the Sun story with a normal speech rate, telling the story semi-spontaneously and reading the English version of the story. Both raw and normalised rhythmic measures were calculated using vocalic, consonantal and syllabic durations (∆C, ∆V, ∆S, %V, VarcoC, VarcoV, VarcoS, rPVI_C, rPVI_S, nPVI_V, nPVI_S). Results confirm the syllabletiming impression of Cantonese and Mandarin. Data of the two foreign English accents poses a challenge to the rhythmic measures because the two accents are syllable-timed impressionistically but were classified as stress-timed by some of the rhythmic measures (∆C, rPVI_C, nPVI_V, ∆S, VarcoS, rPVI_S and nPVI_S). VarcoC and %V give the best classification of speech rhythm in this study.

Computation of L2 speech rhythm based on duration and fundamental frequency

2017

Rhythmic characteristics of speech vary between native and non-native speakers. Studies comparing the rhythmic properties of L1 and L2 speech based on rhythm metrics have shown that this relationship is far from straightforward. It seems evidently the case that the difference between native and non-native speech is a complex interaction of a variety of rhythmic cues (duration, F0 and intensity). In this study we extended the durational domain by F0 and tested whether metrics combining duration and F0 (henceforth combined measures) could better account for the rhythmic differences between L1 and L2 speech. To test this, 5 native Mandarin speakers and 5 Italian learners of Mandarin recorded The North Wind and the Sun in Mandarin. Besides, each Italian speaker also recorded the same text in Italian. Each sentence in L1-L2 Chinese and in L1 Italian was segmented into syllables. We calculated duration- and F0-based metrics (Δsyllable duration/F0; r-PVI syllable duration/F0; Varco syllabl...

Rhythm measures and dimensions of durational variation in speech

Journal of The Acoustical Society of America, 2011

Patterns of durational variation were examined by applying 15 previously published rhythm measures to a large corpus of speech from five languages. In order to achieve consistent segmentation across all languages, an automatic speech recognition system was developed to divide the waveforms into consonantal and vocalic regions. The resulting duration measurements rest strictly on acoustic criteria. Machine classification showed that rhythm measures could separate languages at rates above chance. Within-language variability in rhythm measures, however, was large and comparable to that between languages. Therefore, different languages could not be identified reliably from single paragraphs. In experiments separating pairs of languages, a rhythm measure that was relatively successful at separating one pair often performed very poorly on another pair: there was no broadly successful rhythm measure. Separation of all five languages at once required a combination of three rhythm measures. ...

APPLYING DURATIONAL METRICS TO RECORDED SPEECH VS. TTS – EVIDENCE FROM 6 LANGUAGES

The present study is motivated by the observation that TTS samples of concatenative synthesis may often sound as a-rhythmic to the human ear. For this reason, we aimed at comparing samples of speech from real speakers (studio recordings) vs. synthetic samples. We followed a fairly complicated procedure in order to obtain comparable samples, including several steps. First, we selected 2000 recorded sentences of (each) English, French, German, Spanish, Italian and Japanese from bigger speech corpora (1 speaker per language). Then, we further divided them into (a) 1500 sentences to be included in the speech-base of the TTS system, and (b) 500 sentences for test. The former 1500 sentences were used to create TTS voices (for each of the six languages), with which we synthesized the latter 500 sentences. We thereby obtained 500 sentences in two flavours, i.e. as recorded samples and as TTS samples. These were all automatically segmented starting from the text, which was itself phonetically transcribed by our in-house software. The result of the transcription and segmentation was then converted to C and V intervals by an ad-hoc script and imported to R for analysis. Pearson’s correlation coefficient was calculated for each segment of recorded speech vs. TTS samples. Results show that correlation is high overall (ranging from 0.82 to 0.93 for the six speaker-TTS pairs) and prove that concatenative synthesis is able to reproduce global durational characteristics of speech. Values for the most popular rhythm metrics (deltas, %V, PVIs, CCIs) were also computed and plotted to charts to illustrate durational variability for the samples. Results for recorded samples of the six languages studied here reflect previous results reported in the literature: English and German samples tend to show greater durational variability than French, Spanish, Italian and Japanese; and they tend to have lower vocalic percentage. Results for TTS samples also show the very same trend. In fact, the mean values for each TTS system tends to sit very close to its recorded speech counterpart. Also, it is remarkable that no general trend was found to connect TTS samples with their recorded speech counterpart: that is to say, TTS samples may sometimes show (slightly) more or (slightly) less durational variability of C and V segments: there does not seem to be a general tendency in our data in this respect. We claim that traditional rhythm correlates are global measures that account for average durational variability of speech samples. They are good at giving a general overview of the rhythmic/timing properties of speech, but they are not able to detect specific local a-rhythmical phenomena found in TTS output, which are probably rooted in the prosodic pattern of the sentence. In the future, it would be desirable to develop acoustic indices that are able to detect such phenomena.

Rhythmic variability between some Asian Languages: Results from an automatic analysis of temporal characteristics

The rhythmic organization of speech can vary between languages. In the present research we studied rhythmic variability between Mandarin, Cantonese and Thai using automatically retrieved prosodic temporal characteristics from read speech. We measured the variability of intervals between amplitude peaks in the amplitude envelope (<10 Hz) and the durational characteristics of intervals with and without glottal activity (voiced and unvoiced intervals) in speech. Results for between language comparisons revealed significant differences between languages in both amplitude peak interval variability and voiced-voiceless interval durational characteristics. Results are discussed in connection with language specific phonotactic/phonological properties and hypotheses about the perceptual significance of the acoustic measurements in terms of speech rhythm.

Towards a perceptual model of speech rhythm: Integrating the influence of f0 on perceived duration

Proceedings of Interspeech 2014, 2014

Previous accounts of speech rhythm focus mainly on duration. For example, the normalised Pairwise Variability Index for vocalic intervals (nPVI-V) quantifies relative duration differences between successive vocalic intervals. Prototypical syllabletiming is characterised by small differences in duration, prototypical stress-timing by large differences. However, differences in f0 between vocalic intervals are thought to influence the perception of duration. This paper (1) quantifies the influence of differences in f0 on perceived duration in a perception experiment, and (2) suggests a modified PVI (nPVI-V(dur*f0)) that takes account of this influence. The new nPVI-V(dur*f0) is then applied to a speech corpus of (stress-timed) British English and (syllable-timed) Indian English. The results are compared to the application of the old nPVI-V, which takes into account duration only, to the same data set.

The usefulness of metrics in the quantification of speech rhythm

Journal of Phonetics, 2012

The performance of the rhythm metrics ΔC, %V, PVIs and Varcos, said to quantify rhythm class distinctions, was tested using English, German, Greek, Italian, Korean and Spanish. Eight participants per language produced speech using three elicitation methods, spontaneous speech, story reading and reading a set of sentences divided into "uncontrolled" sentences from original works of each language, and sentences devised to maximize or minimize syllable structure complexity ("stress-timed" and "syllable-timed" sets respectively). Rhythm classifications based on pooled data were inconsistent across metrics, while cross-linguistic differences in scores were often statistically non-significant even for comparisons between prototypical languages like English and Spanish. Metrics showed substantial inter-speaker variation and proved very sensitive to elicitation method and syllable complexity, so that the size of both effects was large and often comparable to that of language. These results suggest that any cross-linguistic differences captured by metrics are not robust; metric scores range substantially within a language and are readily affected by a variety of methodological decisions, making crosslinguistic comparisons and rhythmic classifications based on metrics unsafe at best.