Automatic labeling of speech synthesis corpora (original) (raw)

Automatic Segment Alignment for Concatenative Speech Synthesis in Portuguese

1998

Concatenative Text-To-Speech synthesizers join pre-recorded segments of speech data in order to produce high quality output speech. The synthesizer has to find the best segment to concatenate from an inventory of speech material. In order to do that, the inventory should be built from a correctly transcribed and time aligned speech database. This paper describes the construction of an automatically alignment tool using a Hidden Markov Model using very small training and test sets.

Automatic Segmentation for Czech Concatenative Speech Synthesis Using Statistical Approach with Boundary-Specific Correction

2003

This paper deals with the problems of automatic segmentation for the purposes of Czech concatenative speech synthesis. Statistical approach to speech segmentation using hidden Markov models (HMMs) is applied in the baseline system. Several improvements of this system are then proposed to get more accurate segmentation results. These enhancements mainly concern the various strategies of HMM initialization (flat-start initialization, hand-labeled or speaker independent HMM bootstrapping). Since HTK, the hidden Markov model toolkit, was utilized in our work, a correction of the output boundary placements is proposed to reflect speech parameterization mechanism. An objective comparison of various automatic methods and manual segmentation is performed to find out the best method. The best results were obtained for boundary-specific statistical correction of the segmentation that resulted from bootstrapping with hand-labeled HMMs (96% segmentation accuracy in tolerance region 20 ms).

Phonetic alignment: speech synthesis based vs. hybrid HMM/ANN

1998

In this paper we compare two different methods for phonetically labeling a speech database. The first approach is based on the alignment of the speech signal on a high quality synthetic speech pattern, and the second one uses a hybrid HMM/ANN system. Both systems have been evaluated on French read utterances from a speaker never seen in the training stage of the HMM/ANN system and manually segmented. This study outlines the advantages and drawbacks of both methods. The high quality speech synthetic system has the great advantage that no training stage is needed, while the classical HMM/ANN system easily allows multiple phonetic transcriptions. We deduce a method for the automatic constitution of phonetically labeled speech databases based on using the synthetic speech segmentation tool to bootstrap the training process of our hybrid HMM/ANN system. The importance of such segmentation tools will be a key point for the development of improved speech synthesis and recognition systems.

Experiments with Automatic Segmentation for Czech Speech Synthesis

Lecture Notes in Computer Science, 2003

This paper deals with the automatic segmentation for Czech Concatenative speech synthesis. Statistical approach to speech segmentation using hidden Markov models (HMMs) is applied in the baseline system [1]. Several experiments that concern various issues in the process of building the segmentation system, such as speech parameterization or HMM initialization problems, are described here. An objective comparison of various experimental automatic and manual segmentations is performed to find out the best settings of the segmentation system with respect to our single-female-speaker continuous speech corpus.

German and Czech speech synthesis using HMM-based speech segment database

2002

This paper presents an experimental German speech synthesis system. As in case of a Czech text-to-speech system ARTIC, statistical approach (using hidden Markov models) was employed to build a speech segment database. This approach was confirmed to be language independent and it was shown to be capable of designing a quality database that led to an intelligible synthetic speech of a high quality. Some experiments with clustering the similar speech contexts were performed to enhance the quality of the synthetic speech. Our results show the superiority of phoneme-level clustering to subphoneme-level one.

Automatic generation of speech synthesis units based on closed loop training

1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1997

This paper proposes a new method for automatically generating speech synthesis units. A small set of synthesis units is selected from a large speech database by the proposed Closed-Loop Training method (CLT). Because CLT is based on the evaluation and minimization of the distortion caused by the synthesis process such as prosodic modication, the selected synthesis units are most suitable for synthesizers. In this paper, CLT is applied t o a w a v eform concatenation based synthesizer, whose basic unit is CV/VC(diphone). It is shown that synthesis units can be eciently generated by CLT from a labeled speech database with a small amount of computation. Moreover, the synthesized speech is clear and smooth even though the storage size of the waveform dictionary is small.

Improving consistence of phonetic transcription for text-to-speech

2009

Grapheme-to-phoneme conversion is an important step in speech segmentation and synthesis. Many approaches are proposed in the literature to perform appropriate transcriptions: CART, FST, HMM, etc. In this paper we propose the use of an automatic algorithm that uses the transformation-based errordriven learning to match the phonetic transcription with the speaker's dialect and style. Different transcriptions based on word, part-of-speech tags, weak forms and phonotactic rules are validated. The experimental results show an improvement in the transcription using an objective measure. The articulation MOS score is also improved, as most of the changes in phonetic transcription affect coarticulation effects.