Proc. 5th ISCA speech synthesis workshop (original) (raw)

Articulatory Speech Synthesizer

2000

The aim of this research was to gain more understanding about the unvoiced speech production mechanisms, and develop one solution to the excitation relocation problem for the vocally handicapped. A flexible and high quality articulatory speech synthesis tool called ARTM was developed to achieve this aim. The results of this study will be of interest to researchers in speech modeling, analysis, and synthesis. Speech is generated when air is expelled from the lungs through the larynx, passed into the vocal cavities and finally, radiated at the mouth and nostrils. It can be roughly classified as either voiced sounds, fricatives or plosives. Since most portions of speech are voiced sounds, analyzing voiced sounds provides us with an understanding of the production of phonetic information and vocal characteristics. Numerous algorithms have been proposed and many of them give successful results for voiced speech. However, little effort has been devoted to the analysis of unvoiced speech. ...

Parameterization of vocal fry in HMM-based speech synthesis

2009

HMM-based speech synthesis offers a way to generate speech with different voice qualities. However, sometimes databases contain certain inherent voice qualities that need to be parametrized properly. One example of this is vocal fry typically occurring at the end of utterances. A popular mixed excitation vocoder for HMM-based speech synthesis is STRAIGHT. The standard STRAIGHT is optimized for modal voices and may not produce high quality with other voice types. Fortunately, due to the flexibility of STRAIGHT, different F0 and aperiodicity measures can be used in the synthesis without any inherent degradations in speech quality. We have replaced the STRAIGHT excitation with a representation based on a robust F0 measure and a carefully determined two-band voicing. According to our analysis-synthesis experiments, the new parameterization can improve the speech quality. In HMM-based speech synthesis, the quality is significantly improved especially due to the better modeling of vocal fry. Index Terms: speech synthesis, hidden Markov models, vocal fry, mixed excitation, STRAIGHT

Improved method for model parameters extraction used in high-quality speech synthesis

În cadrul acestei lucrări este abordată tema reprezentării parametrice a semnalului vocal prin intermediul modelului armonic plus zgomot (HNM) configurat pentru o aplicaţie de sinteză vocală de calitate ridicată. Operaţiile specifice acestui model explorate în lucrarea de faţă au în vedere metode nou introduse pentru extragerea perioadei fundamentale şi detecţia frecvenţei sonore maxime. Utilizând teste subiective de ascultare, a fost realizată o comparaţie între modelul clasic şi versiunea care include metodele îmbunătăţite prezentate în această lucrare. Rezultatele comparaţiei au arătat fără echivoc o superioritate măsurabilă a soluţiilor propuse. This paper addresses the representation of the speech signal using the harmonic plus noise model (HNM) configured for a high quality speech synthesis application. The features of the model explored in this paper mainly relate to new introduced methods for pitch period extraction, and maximum voiced frequency detection. Based on subjective listening tests, a comparison between the classic HNM and the version containing our improved methods, clearly shows (in terms of comparative mean opinion score-CMOS) the superiority of the proposed solutions.

Physiologically motivated modelling of the voice source in articulatory analysis/synthesis

Speech Communication, 1996

This paper describes the implementation of a new parametric model of the glottal geometry aimed at improving male and female speech synthesis in the framework of articulatory analysis synthesis. The model represents glottal geometry in terms of inlet and outlet area waveforms and is controlled by parameters that are tightly coupled to physiology, such as vocal fold abduction. It is embedded in an articulatory analysis synthesis system (articulatory speech mimic). To introduce naturally occurring details in our synthetic glottal flow waveforms, we modelled two different kinds of leakage: a " linked leak" and a " parallel chink". While the first is basically an incomplete glottal closure, the latter models a second glottal duct that is independent of the membranous (vibrating) part of the glottis. Characteristic for both types of leaks is that they increase dc-flow and source/tract interaction. A linked leak, however, gives rise to a steeper roll-off of the entire glottal flow spectrum, whereas a parallel chink decreases the energy of the lower frequencies more than the higher frequencies. In fact, for a parallel chink, the slope at the higher freqencies is more or less the same as in the no-leakage case. Zusammenfassung Dieser Aufsatz beschreibt die Implementierung eines neuen parametrischen Modells der glottalen Geometrie. Unsere Arbeit zielt auf eine bessere Synthese männlicher und weiblicher Sprache im Rahmen von Systemen zur artikulatorischen Analyse/Synthese. Das Modell repräsentiert die glottale Geometrie als abhängig von den Zeitfunktionen der Querschnittsflächen am Ein-und Ausgang der Glottis. Die Steuerparameter des Modells sind stark an die Physiologie angelehnt, wie zum Beispiel glottale Abduktion. Unser Modell ist Teil eines artikulatorischen Analyse/Synthese-Systems. Um die im natürlichen Vorbild vorhandenen Details in der synthetischen Zeitfunktion des glottalen Strömungs zu reproduzieren, haben wir zwei verschiedene Arten von akustisch-wirksamen glottaler Lecks (Undichtigkeiten) implementiert: ein " verbundenes Leck" und eine " parallele Spalte". Während es sich im ersten Fall im wesentlichen um einen unvollständigen glottalen Verschluss handelt, stellt der zweite Fall einen zweiten glottalen Kanal dar, der unabhängig von dem knorpeligen (vibrierenden) Teil der Glottis ist. Charaketeristisch für beide Lecktypen ist, dass sie die DC-Strömung und die Interaktion von Quellsignal und Ansatzrohr erhöhen. Ein verbundenes Leck bewirkt jedoch einen steileren Abfall des gesamten glottalen Strömungsspektrums; eine parallele Spalte hingegen erniedrigt die Energie der tieferen Frequenzen stärker als die der höheren Frequenzen. Tatsächlich ist es so, dass für eine parallele Spalte der Abfall bei höheren Frequenzen mehr oder weniger derselbe ist wie im Falle nicht vorhandener Lecks.

Factors Influencing Vocal Pitch In Articulatory Speech Synthesis: A Study Using Praat

Proceedings of the SMC Conferences, 2016

An extensive study on the parameters influencing the pitch of a standard speaker in articulatory speech synthesis is presented. The speech synthesiser used is the articulatory synthesiser in PRAAT. Categorically, the repercussion of two parameters: Lungs and Cricothyroid on the average pitch of the synthesised sounds are studied. Statistical analysis of synthesis data proclaims the extent to which each of the variables transforms the tonality of the speech signals.

Analysis and synthesis of hypo and hyperarticulated speech

Proc. Speech Synthesis Workshop, 2010

This paper focuses on the analysis and synthesis of hypo and hyperarticulated speech in the framework of HMM-based speech synthesis. First of all, a new French database matching our needs was created, which contains three identical sets, pronounced with three different degrees of articulation: neutral, hypo and hyperarticulated speech. On that basis, acoustic and phonetic analyses were performed. It is shown that the degrees of articulation significantly influence, on one hand, both vocal tract and glottal characteristics, ...

Analysis and HMM-based synthesis of hypo and hyperarticulated speech

Computer Speech & Language, 2014

Hypo and hyperarticulation refer to the production of speech with respectively a reduction and an increase of the articulatory efforts compared to the neutral style. Produced consciously or not, these variations of articulatory efforts depend upon the surrounding environment, the communication context and the motivation of the speaker with regard to the listener. The goal of this work is to integrate hypo and hyperarticulation into speech synthesizers, such that they are more realistic by automatically adapting their way of speaking to the contextual situation, like humans do. Based on our preliminary work, this paper provides a thorough and detailed study on the analysis and synthesis of hypo and hyperarticulated speech. It is divided into three parts. In the first one, we focus on both acoustic and phonetic modifications due to articulatory effort changes. The second part aims at developing a HMM-based speech synthesizer allowing a continuous control of the degree of articulation. This requires to first tackle the issue of speaking style adaptation to derive hypo and hyperarticulated speech from the neutral synthesizer. Once this is done, an interpolation and extrapolation of the resulting models enables to finely tune the voice so that it is generated with the desired articulatory efforts. Finally the third and last part focuses on a perceptual study of speech with a variable articulation degree, where it is analyzed how intelligibility and various other voice dimensions are affected.

Modeling different voice qualities for female and male talkers using a geometric-kinematic articulatory voice source model: preliminary results

Modeling natural sounding voice qualities -for example the pressed-modalbreathy voice quality continuum which widely occurs during normal speech production -is a crucial point in speech synthesis. A parametric voice source model using prescribed sinusoidal vocal fold vibration patterns (i.e. extended Titze model) is introduced in this paper. This voice source model was adapted for synthesis of a typical male and female voice. A simulation experiment was performed by varying glottal abduction/adduction in order to generate a voice quality continuum from pressed over modal towards breathy. A parameter analysis of the resulting waveshapes of glottal flow and its time derivative was carried out in terms of the LF-model. This analysis indicates that our parametric voice source model is flexible enough to generate the modal to breathy but not the modal to pressed voice quality continuum for the male as well as for the female voice. It can be hypothesized that a self-oscillating voice source 1. This paper is dedicated to the 60th Birthday of Bernd Pompino-Marschall. I met Bernd Pompino-Marschall in the early 90th of the last century as a young scientist when Bernd already was Professor in Berlin. I thank him especially for supporting me in my hard years between PhD and getting my permanent position at RWTH Aachen University. In these years I always was welcome to visit his Lab at ZAS and at Humboldt University of Berlin and I was employed in Berlin from 1999 to 2000. Great times! I learned a lot! Thank you for all, Bernd! 98 model is needed in order to generate the whole spectrum of vocal fold vibration patterns occurring during normal speech production.

Parametric model for vocal effort interpolation with harmonics plus noise models

Speech Synthesis Workshop, 2013

It is known that voice quality plays an important role in expressive speech. In this paper, we present a methodology for modifying vocal effort level, which can be applied by text-to-speech (TTS) systems to provide the flexibility needed to improve the naturalness of synthesized speech. This extends previous work using low order Linear Prediction Coefficients (LPC) where the flexibility was constrained by the amount of vocal effort levels available in the corpora. The proposed methodology overcomes these limitations by replacing the low order LPC by ninth order polynomials to allow not only vocal effort to be modified towards the available templates, but also to allow the generation of intermediate vocal effort levels between levels available in training data. This flexibility comes from the combination of Harmonics plus Noise Models and using a parametric model to represent the spectral envelope. The conducted perceptual tests demonstrate the effectiveness of the proposed technique in performing vocal effort interpolations while maintaining the signal quality in the final synthesis. The proposed technique can be used in unit-selection TTS systems to reduce corpus size while increasing its flexibility, and the techniques could potentially be employed by HMM based speech synthesis systems if appropriate acoustic features are being used.