Perceptually Motivated Parameters for Automatic Prosodic Annotation (original) (raw)
2012
Sign up for access to the world's latest research
checkGet notified about relevant papers
checkSave papers to use in your research
checkJoin the discussion with peers
checkTrack your impact
Abstract
This contribution presents an approach to automatic prosodic annotation which emphasizes the linguistic motivation and perceptual relevance of the features used for classifying the prosodic categories. The analyses and experiments presented here were conducted on a 2.5 hours German news-like corpus which had been manually annotated using GToBI(S) (Mayer, 1995). GToBI(S) is an adaptation of American English ToBI (Silverman et al., 1992; Beckman and Ayers, 1994) to German.
Related papers
Vocale - A Semi-Automatic Annotation Tool for Prosodic Research
Large annotated speech corpora are a critical component of research in prosody. The classification of languages according to their speech rhythm, for example, requires a great number of annotated sentences by different speakers in different languages. We have developed Vocale, a tool for the semiautomatic annotation of vocalic and consonantal parts of speech because in recent models these units have been identified as reliable acoustic correlates of speech rhythm. Vocale is based on relative entropy and uses various additional classifiers such as energy and length for the annotation of vowels and consonants. It runs using Praat speech analysis facilities and gives a Praat label file as an output. Vocale is open source software and is available to the scientific community under http://www.ime.usp.br/#tycho/tipal/prosody/vocale/.
ProPOSEC: A Prosody and PoS Annotated Spoken English Corpus
2010
We have previously reported on ProPOSEL, a purpose-built Prosody and PoS English Lexicon compatible with the Python Natural Language ToolKit. ProPOSEC is a new corpus research resource built using this lexicon, intended for distribution with the Aix-MARSEC dataset. ProPOSEC comprises multi-level parallel annotations, juxtaposing prosodic and syntactic information from different versions of the Spoken English Corpus, with canonical dictionary forms, in a query format optimized for Perl, Python, and text processing programs. The order and content of fields in the text file is as follows: (1) Aix-MARSEC file number; (2) word; (3) LOB PoS-tag; (4) C5 PoS-tag; (5) Aix SAM-PA phonetic transcription; (6) SAM-PA phonetic transcription from ProPOSEL; (7) syllable count; (8) lexical stress pattern; (9) default content or function word tag; (10) DISC stressed and syllabified phonetic transcription; (11) alternative DISC representation, incorporating lexical stress pattern; (12) nested arrays of phonemes and tonic stress marks from Aix. As an experimental dataset, ProPOSEC can be used to study correlations between these annotation tiers, where significant findings are then expressed as additional features for phrasing models integral to Text-to-Speech and Speech Recognition. As a training set, ProPOSEC can be used for machine learning tasks in Information Retrieval and Speech Understanding systems.
PROSOTRAN: a tool to annotate prosodically non-standard data
Assigning a prosodic transcription that encompasses all prosodic phenomena (intonation, accentuation and phrasing) is difficult mainly because: (i) encoding all the prosodic phenomena usually supposes a knowledge of the language to transcribe; and (ii) a representation of the various phenomena cannot be achieved without taking into account the three prosodic parameters. In this paper, we present a tool, PROSOTRAN, which automatically assigns to each utterance a multi-tiered transcription that symbolically represents how the three prosodic parameters (F 0 , duration & energy) do vary over time. Assigning labels to each syllable avoids segmenting the signal into linguistic units that are difficult to define when the language to transcribe is not known.
Criteria for labelling prosodic aspects of English speech
1992
We report a set of labelling criteria which have been developed to label prosodic events in clear, continuous speech, and propose a scheme whereby this information can be transcribed in a machine readable format. We have chosen to annotate prosody in a syllabic domain which is synchronised with a phonemic segmentation. A procedural de nition of syllables based on the grouping of phones is presented. The criteria for hand labelling the prominence of each syllable, tone-unit boundaries and the pitch movement associated with each accented syllable, are described. Work to automate this process is presented and experimental results evaluating its performance are included.
Optimizing the automatic functional annotation of English intonation
One of the fundamental aims of prosodic analysis is to provide a reliable means of extracting functional information (what prosody contributes to meaning) directly from prosodic form. It has been argued that an explicit model of the mapping from prosodic function to prosodic form could provide an objective way of approaching this task. In this presentation we look specifically at some of the problems of optimizing this mapping in order to extract the functional information automatically from the formal representation, hence ultimately directly from the acoustic data.
Automatic classification of prosodically marked phrase boundaries in German
1994
A large corpus has been created automatically and read by 100 speakers. Phrase boundaries were labeled in the sentences automatically during sentence generation. Perception experiments on a subset of 500 utterances showed a high agreement between the automatically generated boundary markers and the ones perceived by listeners. Gaussian distribution and polynomial classi ers were trained on a set of prosodic features computed from the speech signal using the automatically generated boundary markers. Comparing the classi cation results with the judgments of the listeners yielded in a recognition rate of 87%. A combination with stochastic language models improved the recognition rate to 90%. We found that the pause and the durational features are most important for the classication, but that the in uence of F0 is not neglectable.
Automatic annotation and classification of phrase accents in spontaneous speech
6th European Conference on Speech Communication and Technology (Eurospeech 1999)
During the last years, we have been working on the automatic classi cation of boundaries and accents in the German VERBMOBIL VM project human-human communication, appointment scheduling dialogues. A sub-corpus was annotated manually with prosodic boundary and accent labels, and neural networks NN trained with a large set of prosodic features were used for automatic classi cation. The classi cation of boundaries could be improved markedly with a combination of the NN with a language model LM that was trained with manually annotated syntactic-prosodic boundary labels in a much larger sub-corpus. Here we show how a combination of NN with LM along similar lines can be used for an improvement of accent classi cation as well. For the training of the LM, accents are annotated automatically in the transliteration with the help of a rule based system that uses part of speech POS as well as other linguistic phonological information.
Experiments on automatic prosodic labeling
This paper presents results from experiments on automatic prosodic labeling. Using the WEKA machine learning software [1], classifiers were trained to determine for each syllable in a speech database of a male speaker its pitch accent and its boundary tone. Pitch accents and boundaries are according to the GToBI(S) dialect, with slight modifications. Classification was based on 35 attributes involving PaIntE F0 parametrization [2] and normalized phone durations, but also some phonological information as well as higher-linguistic information. Several classification algorithms yield results of approx. 78% accuracy on the word level for pitch accents, and approx. 88% accuracy on the word level for phrase boundaries, which compare very well to results of other studies. The classifiers generalize to similar data of a female speaker in that they perform equally well as classifiers trained directly on the female data. Index Terms: perception of prosody, prosodic labeling, F0 parametrization 1 The default settings of the IBk learning scheme implemented in WEKA set the k parameter to 1. The k parameter determines the number of neighbours considered in classification of new instances, and with k=1 this learning scheme is identical to the separate IB1 learning scheme in WEKA. Therefore, k=30 was used instead.
Prosodic Classification of Discourse Markers
2015
The first contribution of this study is the description of the prosodic behavior of discourse markers present in two speech corpora of European Portuguese (EP) in different domains (university lectures, and map-task dialogues). The second contribution is a multiclass classification to verify, given their prosodic features, which words in both corpora are classified as discourse markers, which are disfluencies, and which correspond to words that are neither markers nor disfluencies (chunks). Our goal is to automatically predict discourse markers and include them in rich transcripts, along with other structural metadata events (e.g., disfluencies and punctuation marks) that are already encompassed in the language models of our in-house speech recognizer. Results show that the automatic classification of discourse markers is better for the lectures corpus (87%) than for the dialogue corpus (84%). Nonetheless, in both corpora, discourse markers are more easily confused with chunks than with disfluencies.
Developing an automatic functional annotation system for british English intonation
One of the fundamental aims of prosodic analysis is to provide a reliable means of extracting functional information (what prosody contributes to meaning) directly from prosodic form (i.e. what prosody is -in this case intonation). This paper addresses the development of an automatic functional annotation system for British English. It is based on the study of a large corpus of British English and a procedure of analysis by synthesis, enabling to test and enrich different models of English intonation on the one hand and work towards an automatic version of the annotation process on the other.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
References (12)
- Beckman, M. E. and Ayers, G. M. (1994). Guidelines for tobi labelling, version 2.0.
- Braunschweiler, N. (2006). The prosodizer -automatic prosodic annotations of speech synthesis databases. In Proceedings of Speech Prosody 2006 (Dresden).
- Hasegawa-Johnson, M., Chen, K., Cole, J., Borys, S., Kim, S.-S., Cohen, A., Zhang, T., Choi, J.-Y., Kim, H., and Yoon, T. (2005). Simultaneous recognition of words and prosody in the boston university radio speech corpus. Speech Communication, 46(3-4), 418-439.
- House, D. (1996). Differential perception of tonal contours through the syllable. In Proceedings of the International Conference on Spoken Language Processing (Philadelphia, PA), volume 1, pages 2048-2051.
- Jilka, M. and Möbius, B. (2007). The influence of vowel quality features on peak alignment. In Proceedings of Interspeech 2007 (Antwerpen), pages 2621-2624.
- Mayer, J. (1995). Transcription of German intonation-the Stuttgart system. Technical report, Institute of Natural Language Processing, University of Stuttgart.
- Möhler, G. and Conkie, A. (1998). Parametric modeling of intonation using vector quantization. In Proceedings of the Third International Workshop on Speech Synthesis (Jenolan Caves, Australia), pages 311-316.
- Rosenberg, A. (2009). Automatic Detection and Classification of Prosodic Events. Ph.D. thesis, Columbia University.
- Schweitzer, A. (2011). Production and Perception of Prosodic Events-Evidence from Corpus-based Experiments. Doctoral dissertation, Universität Stuttgart.
- Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., and Hirschberg, J. (1992). Tobi -a standard for labeling english prosody. In Proceedings of the International Conference on Spoken Language processing (ICSLP, Banff), pages 867-870.
- Sridhar, V. K. R., Bangalore, S., and Narayanan, S. (2008). Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE Transactions on Audio, Speech, and Language Processing, 16(4). Zeißler, V., Adelhardt, J., Batliner, A., Frank, C., Nöth, E., Shi, R. P., and Niemann, H. (2006). The Prosody Module, pages 139-152.
- Springer, Berlin.
Related papers
Linguistic Annotation of Two Prosodic Databases
2007
Abstract Two prosodic databases were annotated with linguistic information using SGML (Standard General Markup Language), one database of American English and one of Modern Standard German. Only information that might have prosodic correlates was annotated. Phonetic and morphological information was supplied by automatic tools and then hand corrected. Semantic and pragmatic information was inserted by hand. The SGML tagset is essentially the same for both languages.
A Framework for Language-Independent Analysis and Prosodic Feature Annotation of Text Corpora
Text, Speech and …, 2008
Concept-to-Speech systems include Natural Language Generators that produce linguistically enriched text descriptions which can lead to significantly improved quality of speech synthesis. There are cases, however, where either the generator modules produce pieces of non-analyzed, non-annotated plain text, or such modules are not available at all. Moreover, the language analysis is restricted by the usually limited domain coverage of the generator due to its embedded grammar. This work reports on a language-independent framework basis, linguistic resources and language analysis procedures (word/sentence identification, partof-speech, prosodic feature annotation) for text annotation/processing for plain or enriched text corpora. It aims to produce an automated XML-annotated enriched prosodic markup for English and Greek texts, for improved synthetic speech. The markup includes information for both training the synthesizer and for actual input for synthesising. Depending on the domain and target, different methods may be used for automatic classification of entities (words, phrases, sentences) to one or more preset categories such as "emphatic event", "new/old information", "second argument to verb", "proper noun phrase", etc. The prosodic features are classified according to the analysis of the speech-specific characteristics for their role in prosody modelling and passed through to the synthesizer via an extended SOLE-ML description. Evaluation results show that using selectable hybrid methods for part-of-speech tagging high accuracy is achieved. Annotation of a large generated text corpus containing 50% enriched text and 50% canned plain text produces a fully annotated uniform SOLE-ML output containing all prosodic features found in the initial enriched source. Furthermore, additional automatically-derived prosodic feature annotation and speech synthesis related values are assigned, such as word-placement in sentences and phrases, previous and next word entity relations, emphatic phrases containing proper nouns, and more.
Automatic Annotation of Speech Corpora for Prosodic Prominence
2004
This paper presents a study on the automatic detection of prosodic prominence in continuous speech, with particular reference to American English, but with good prospects of application to other languages. Perceptual prosodic prominence is supported by two different prosodic features: pitch accent and stress. Pitch accent is acoustically connected with fundamental frequency (F0) movements and overall syllable energy, whereas stress exhibits a strong correlation with syllable nuclei duration and mid-to-high-frequency emphasis. This paper shows that a careful measurement of these acoustic parameters, as well as the identification of their connection to prosodic phenomena, makes it possible to build automatic systems capable of identifying prominent syllables in utterances with performance comparable with the inter-human agreement reported in the literature without using any kind of information apart the acoustic parameters derived directly from speech waveforms.
An Automatic Prosody Tagger for Spontaneous Speech
2016
Speech prosody is known to be central in advanced communication technologies. However, despite the advances of theoretical studies in speech prosody, so far, no large scale prosody annotated resources that would facilitate empirical research and the development of empirical computational approaches are available. This is to a large extent due to the fact that current common prosody annotation conventions offer a descriptive framework of intonation contours and phrasing based on labels. This makes it difficult to reach a satisfactory inter-annotator agreement during the annotation of gold standard annotations and, subsequently, to create consistent large scale annotations. To address this problem, we present an annotation schema for prominence and boundary labeling of prosodic phrases based upon acoustic parameters and a tagger for prosody annotation at the prosodic phrase level. Evaluation proves that inter-annotator agreement reaches satisfactory values, from 0.60 to 0.80 Cohen's kappa, while the prosody tagger achieves acceptable recall and f-measure figures for five spontaneous samples used in the evaluation of monologue and dialogue formats in English and Spanish. The work presented in this paper is a first step towards a semi-automatic acquisition of large corpora for empirical prosodic analysis.
Improving the phonetic annotation by means of prosodic phrasing
Fifth European …, 1997
It was established that the performance of our annotation system [8] is affected by the length of the utterances: the error rate, the CPU-load and the memory requirements tend to increase as the utterances get longer. In this contribution the speech signal is first segmented into speech, pauzes and noise (breaths, clicks, : : : ) and subsequently split in signal phrases prior to the annotation. Experiments on 3 different databases (3 languages) demonstrate that this stategy yields a significant improvement of the annotation accuracy.
Automatic ToBI prediction and alignment to speed manual labeling of prosody
Speech Communication, 2001
Tagging of corpora for useful linguistic categories can be a time-consuming process, especially with linguistic categories for which annotation standards are relatively new, such as discourse segment boundaries or the intonational events marked in the Tones and Break Indices (ToBI) system for American English. A ToBI prosodic labeling of speech typically takes even experienced labelers from 100 to 200 times real time. An experiment was conducted to determine (1) whether manual correction of automatically assigned ToBI labels would speed labeling, and (2) whether default labels introduced any bias in label assignment. A large speech corpus of one female speaker reading several types of texts was automatically assigned default labels. Default accent placement and phrase boundary location were predicted from text using machine learning techniques. The most common ToBI labels were assigned to these locations for default tones and break type. Predicted pitch accents were automatically aligned to the mid-point of the word, while breaks and edge tones were aligned to the end of the phrase-®nal word. The corpus was then labeled by a group of ®ve trained transcribers working over a period of nine months. Half of each set of recordings was labeled in the standard fashion without default labels, and the other half was presented with preassigned default labels for labelers to correct. Results indicate that labeling from defaults was generally faster than standard labeling, and that defaults had relatively little impact on label assignment. Ó : S 0 1 6 7 -6 3 9 3 ( 0 0 ) 0 0 0 7 3 -X
Annotation of German Intonation: DIMA Compared with other Annotation Systems
2019
Annotating intonation is a considerable challenge, since not only intonational form but also its meaning are complex in terms of their internal make-up and contextual variation. Since the advent of the autosegmental-metrical approach to intonation in the 1980s, the annotation of intonation has continued to be a matter of debate, witnessed by the current discussion around the proposed International Prosodic Alphabet (IPrA), with a reported need for a more surface-related annotation that serves as a basis for phonological categorisation. The DIMA system accounts for such a level by providing a phonetically informed annotation of an intonation contour that nevertheless reflects its phonological core. DIMA is a consensus system for the annotation of German intonation that analyses intonation at three distinct levels: phrasing, tones and prominences. The present paper compares DIMA with other annotation systems such as GToBI, ToGI, IViE, KIM, RaP, and IPrA.