A perceptually and physiologically motivated voice source model. (original) (raw)

Perceptual evaluation of voice source models

Models of the voice source differ in their fits to natural voices, but it is unclear which differences in fit are perceptually salient. This study examined the relationship between the fit of five voice source models to 40 natural voices, and the degree of perceptual match among stimuli synthesized with each of the modeled sources. Listeners completed a visual sort-and-rate task to compare versions of each voice created with the different source models, and the results were analyzed using multidimensional scaling. Neither fits to pulse shapes nor fits to landmark points on the pulses predicted observed differences in quality. Further, the source models fit the opening phase of the glottal pulses better than they fit the closing phase, but at the same time similarity in quality was better predicted by the timing and amplitude of the negative peak of the flow derivative (part of the closing phase) than by the timing and/or amplitude of peak glottal opening. Results indicate that simply knowing how (or how well) a particular source model fits or does not fit a target source pulse in the time domain provides little insight into what aspects of the voice source are important to listeners.

Source model adequacy for pathological voice synthesis

Pathological voices are particularly difficult to inverse filter and fit with source models, in part, because of source-tract interactions inherent in these voice types. In order to obtain good synthetic copies of these voices, we need to know the practical importance of accurately inverse filtering and modeling individual voice sources. To this end, thirty different voices were synthesized, each with four different sources. Preliminary results suggest that the output of the inverse filter does not always contain enough information about the source to adequately reconstruct vocal quality. However, the LF model of the voice source pulse does provide enough degrees of freedom to model naturally-occurring quality variations across voices.

Physiologically motivated modelling of the voice source in articulatory analysis/synthesis

Speech Communication, 1996

This paper describes the implementation of a new parametric model of the glottal geometry aimed at improving male and female speech synthesis in the framework of articulatory analysis synthesis. The model represents glottal geometry in terms of inlet and outlet area waveforms and is controlled by parameters that are tightly coupled to physiology, such as vocal fold abduction. It is embedded in an articulatory analysis synthesis system (articulatory speech mimic). To introduce naturally occurring details in our synthetic glottal flow waveforms, we modelled two different kinds of leakage: a " linked leak" and a " parallel chink". While the first is basically an incomplete glottal closure, the latter models a second glottal duct that is independent of the membranous (vibrating) part of the glottis. Characteristic for both types of leaks is that they increase dc-flow and source/tract interaction. A linked leak, however, gives rise to a steeper roll-off of the entire glottal flow spectrum, whereas a parallel chink decreases the energy of the lower frequencies more than the higher frequencies. In fact, for a parallel chink, the slope at the higher freqencies is more or less the same as in the no-leakage case. Zusammenfassung Dieser Aufsatz beschreibt die Implementierung eines neuen parametrischen Modells der glottalen Geometrie. Unsere Arbeit zielt auf eine bessere Synthese männlicher und weiblicher Sprache im Rahmen von Systemen zur artikulatorischen Analyse/Synthese. Das Modell repräsentiert die glottale Geometrie als abhängig von den Zeitfunktionen der Querschnittsflächen am Ein-und Ausgang der Glottis. Die Steuerparameter des Modells sind stark an die Physiologie angelehnt, wie zum Beispiel glottale Abduktion. Unser Modell ist Teil eines artikulatorischen Analyse/Synthese-Systems. Um die im natürlichen Vorbild vorhandenen Details in der synthetischen Zeitfunktion des glottalen Strömungs zu reproduzieren, haben wir zwei verschiedene Arten von akustisch-wirksamen glottaler Lecks (Undichtigkeiten) implementiert: ein " verbundenes Leck" und eine " parallele Spalte". Während es sich im ersten Fall im wesentlichen um einen unvollständigen glottalen Verschluss handelt, stellt der zweite Fall einen zweiten glottalen Kanal dar, der unabhängig von dem knorpeligen (vibrierenden) Teil der Glottis ist. Charaketeristisch für beide Lecktypen ist, dass sie die DC-Strömung und die Interaktion von Quellsignal und Ansatzrohr erhöhen. Ein verbundenes Leck bewirkt jedoch einen steileren Abfall des gesamten glottalen Strömungsspektrums; eine parallele Spalte hingegen erniedrigt die Energie der tieferen Frequenzen stärker als die der höheren Frequenzen. Tatsächlich ist es so, dass für eine parallele Spalte der Abfall bei höheren Frequenzen mehr oder weniger derselbe ist wie im Falle nicht vorhandener Lecks.

Transformation of voice quality in singing using glottal source features

2019

Glottal activity information can be very important in several speech processing applications, such as in speech therapy , voice disorder diagnosis, voice transformation and text-to-speech synthesis. However, the use of algorithms for estimating glottal parameters from the speech signal is very limited in those applications because of problems with robustness and accuracy. In singing synthesis, the glottal source representation is also very important because it is closely related with the emotions and singing style. This paper proposes a robust method to estimate the voice quality parameters of the glottal source by using both the electroglottographic signal and the acoustic recordings of singing voice for five vowels in three different voice qualities: modal, breathy and creaky. The analysis of the resulting measurements permitted to confirm that voice quality parameters of the glottal source are correlated with the type of voice. Moreover, another experiment was conducted to show that it is possible to transform the modal singing voice into breathy and creaky by using an analysis-synthesis method that incorporates a glottal source model.

Glottal source processing: From analysis to applications

Computer Speech & Language, 2014

The great majority of current voice technology applications relies on acoustic features characterizing the vocal tract response, such as the widely used MFCC of LPC parameters. Nonetheless, the airflow passing through the vocal folds, and called glottal flow, is expected to exhibit a relevant complementarity. Unfortunately, glottal analysis from speech recordings requires specific and more complex processing operations, which explains why it has been generally avoided. This review gives a general overview of techniques which have been designed for glottal source processing. Starting from fundamental analysis tools of pitch tracking, glottal closure instant detection, glottal flow estimation and modelling, this paper then highlights how these solutions can be properly integrated within various voice technology applications.

SYNTHESIS OF THE VOICE SOURCE USING A PHYSICALLY-INFORMED MODEL OF THE GLOTTIS

2001

A physically-informed glottal model is proposed; some physical information is retained in a linear block that accounts for fold mechanics, while non-linear coupling with the airflow is modeled using a regressorbased mapping. The model is used in an identification/resynthesis scheme. Given a real signal, system parameters are estimated via non-linear identification techniques; then the model is used for resynthesizing the signal. With a proper choice of the regressor set the system accurately fits the target waveform and is stable during resynthesis. Physical parameters can be used to change voice quality and speaker identity.

Inter- and Intra-speaker Variability of Glottal Flow Derivative using the LF Model

The vowels /a, i, u/ spoken by American English talkers with non-pathological voices are described by means of voice source model parameters using the Liljencrants-Fant (LF) model. The sampling frequency of the data is 8 kHz which matches approximately telephone bandwidth. After inverse filtering, trends of voice source characteristics depending on the LF parameters are analyzed and compared to literature and listening results. Keywords: voice source, LF model, LF parameters. 1. INTRODUCTION Non-pathological voice source characteristics have been studied by inverse filtering the speech waveform [11], analyzing the speech spectra [6], or by measuring the airflow at the mouth [10]. Knowing the voice source parameters can be beneficial for many speech processing applications, such as speaker identification [8], and speech synthesis. In [6], individual and gender variations in source parameters have been analyzed using measures from speech spectra and taking into account the influence o...

Glottal Source Model Selection for Stationary Singing-Voice by Low-Band Envelope Matching

Proc. of NOLISP workshop 2013, 2013

In this paper a preliminary study on voice excitation model- ing by single glottal shape parameter selection is presented. A strategy for direct model selection by matching derivative glottal source estimates with LF-based candidates driven by the Rd parameter is explored by means of two state-of-the-art similarity measures and a novel one con- sidering spectral envelope information. An experimental study on syn- thetic singing-voice was carried out aiming to compare the performance of the different measures and to observe potential relations with respect to different voice characteristics (e.g. vocal effort, pitch range, amount of aperiodicities and aspiration noise). The results of this study allow us to claim competitive performance of the proposed strategy and suggest us preferable source modeling conditions for stationary singing-voice.

A comparative study of glottal source estimation techniques

Computer Speech & Language, 2012

Abstract Source-tract decomposition (or glottal flow estimation) is one of the basic problems of speech processing. For this, several techniques have been proposed in the literature. However, studies comparing different approaches are almost nonexistent. Besides, experiments have been systematically performed either on synthetic speech or on sustained vowels. In this study we compare three of the main representative state-of-the-art methods of glottal flow estimation: closed-phase inverse filtering, iterative and adaptive inverse ...