carlo drioli | University of Macerata (original) (raw)

Papers by carlo drioli

Informatics in Medicine Unlocked, 2020

This is a PDF file of an article that has undergone enhancements after acceptance, such as the ad... more This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

The aim of this paper is to evaluate the effec-1 tiveness of using video data for voice source pa... more The aim of this paper is to evaluate the effec-1 tiveness of using video data for voice source parametrization 2 in the representation of voice production through physical 3 modeling. Laryngeal imaging techniques can be effectively 4 used to obtain vocal fold video sequences and to derive time 5 patterns of relevant glottal cues, such as folds edge position 6 or glottal area. In many physically based numerical models 7 of the vocal folds, these parameters are estimated from the 8 inverse filtered glottal flow waveform, obtained from audio 9 recordings of the sound pressure at lips. However, this model 10 inversion process is often problematic and affected by accu-11 racy and robustness issues. It is here discussed how video 12 analysis of the fold vibration might be effectively coupled to 13 the parametric estimation algorithms based on voice record-14 ings, to improve accuracy and robustness of model inversion.

In the last few years, there has been a certain attention to the problem of human-human communica... more In the last few years, there has been a certain attention to the problem of human-human communication, trying to devise artificial systems able to mediate a conversational setting between two or more people. In this paper, we designed an automatic system based on a generative structure able to classify hard dialog acts. The generative model is composed by integrating a hierarchical Gaussian mixture model and the Influence Model, originating a brand new method able to deal with such difficult scenarios. The method has been tested on a set of conversational settings involving dialogues between adults and children and adults, in flat and arguing discussions, proving very accurate classification results.

In the last few years, the interest in the analysis of human behavioral schemes has dramatically ... more In the last few years, the interest in the analysis of human behavioral schemes has dramatically grown, in particular for the interpretation of the communication modalities called social signals. They represent well defined interaction patterns, possibly unconscious, characterizing different conversational situations and behaviors in general. In this paper, we illustrate an automatic system based on a generative structure able to analyze conversational scenarios. The generative model is composed by integrating a Gaussian mixture model and the (observed) influence model, and it is fed with a novel kind of simple low-level auditory social signals, which are termed steady conversational periods (SCPs). These are built on duration of continuous slots of silence or speech, taking also into account conversational turn-taking. The interactional dynamics built upon the transitions among SCPs provide a behavioral blueprint of conversational settings without relying on segmental or continuous phonetic features. Our contribution here is to show the effectiveness of our model when applied on dialogs classification and clustering tasks, considering dialogs between adults and between children and adults, in both flat and arguing discussions, and showing excellent performances also in comparison with state-of-the-art frameworks.

A set of rules is proposed for controlling a 2-mass glottal model through activation levels of la... more A set of rules is proposed for controlling a 2-mass glottal model through activation levels of laryngeal muscles. The rules convert muscle activities into physical quantities such as fold adduction, mass, thickness, depth, stiffness. A codebook is constructed between muscular activations and a set of relevant voice source parameters, and its applications to voice source parameter matching are explored.

A glottal model based on physical constraints is proposed. The model describes the vocal fold as ... more A glottal model based on physical constraints is proposed. The model describes the vocal fold as a simple oscillator, i.e. a damped mass-spring system. The oscillator is coupled with a nonlinear block, accounting for fold interaction with the airflow. The nonlinear block is modelled as a regressor-based functional with weights to be identified, and a pitch-synchronous identification procedure is outlined. The model is used to analyse voiced sounds from normal and from pathological voices, and the application of the proposed analysis procedure to voice quality assessment is discussed. 

In this work a glottal model loosely based on the Ishizaka and Flanagan model is proposed, where ... more In this work a glottal model loosely based on the Ishizaka and Flanagan model is proposed, where the number of parameters is drastically reduced. First, the glottal excitation waveform is estimated, together with the vocal tract filter parameters, using inverse filtering techniques. Then the estimated waveform is used in order to identify the nonlinear glottal model, represented by a closedloop configuration of two blocks: a second order resonant filter, tuned with respect to the signal pitch, and a regressor-based functional, whose coefficients are estimated via nonlinear identification techniques. The results show that an accurate identification of real data can be achieved with less than ½¼ regressors of the nonlinear functional, and that an intuitive control of fundamental features, such as pitch and intensity, is allowed by acting on the physically informed parameters of the model.

The synthesis of different voice qualities by means of a low-dimensional glottal model is discuss... more The synthesis of different voice qualities by means of a low-dimensional glottal model is discussed. The glottal model is based on a one-mass model provided with a number of enhancements that make it suitable to the aim of the study. The simulation of modal and non-modal phonatory regimes is discussed. Both symmetric and nonsymmetric configurations are explored. The class of models under consideration is shown to be able to reproduce a broad range of phonation styles and to provide interesting control properties.

A physically-informed glottal model is proposed; some physical information is retained in a linea... more A physically-informed glottal model is proposed; some physical information is retained in a linear block that accounts for fold mechanics, while non-linear coupling with the airflow is modeled using a regressorbased mapping. The model is used in an identification/resynthesis scheme. Given a real signal, system parameters are estimated via non-linear identification techniques; then the model is used for resynthesizing the signal. With a proper choice of the regressor set the system accurately fits the target waveform and is stable during resynthesis. Physical parameters can be used to change voice quality and speaker identity.

Signal Processing, 2003

Radial basis function networks (RBFNs) are used primarily to solve curve-ÿtting problems and for ... more Radial basis function networks (RBFNs) are used primarily to solve curve-ÿtting problems and for non-linear system modeling. Several algorithms are known for the approximation of a non-linear curve from a sparse data set by means of RBFNs. Regularization techniques allow to deÿne constraints on the smoothness of the curve by using the gradient of the function in the training. However, procedures that permit to arbitrarily set the value of the derivatives for the data are rarely found in the literature. In this paper, the orthogonal least squares (OLS) algorithm for the identiÿcation of RBFNs is modiÿed to provide the approximation of a non-linear single-input single-output map along with its derivatives, given a set of training data. The interest in the derivatives of non-linear functions concerns many identiÿcation and control tasks where the study of system stability and robustness is addressed. The e ectiveness of the proposed algorithm is demonstrated with examples in the ÿeld of data interpolation and control of non-linear dynamical systems. ?

Synthesis by p h ysical models is a sound s y n thesis technique which h as recently become p o p... more Synthesis by p h ysical models is a sound s y n thesis technique which h as recently become p o pular due to sound quality a n d expressiveness of control. Usually, t he synthesis algorithm is composed by connecting d elay lines, lters and nonlinear maps, arrang e d i n t he t ypical exciter-resonator interaction structure. We p r o pose a rather general structure based on an interaction scheme w h ere the nonlinear component i s m o d eled by Radial Basis Function Networks. This leads to a system which h as the a bility t o learn the s h ape of the nonlinearity i n o r d er to reproduce a t arget sound. From the w aveform data it is possible to deduce a training set for o -line learning t echniques, and the parameters of the Radial Basis Function Network are computed by i t erated selection of the radial units. In this work we s t art considering m emoryless nonlinear exciters. Thenafter, dynamic exciters are simulated by adopting a Nonlinear ARMA model. Once the system has converged to a w ell behaved instrument m o d el, it is possible to control sound features, such as pitch, by modifying t he p h ysically-informed parameters in an intuitive w ay.

Computing Research Repository, 2000

Radial Basis Function Networks (RBFNs) are used primarily to solve curve-fitting problems and for... more Radial Basis Function Networks (RBFNs) are used primarily to solve curve-fitting problems and for non-linear system modeling. Several algorithms are known for the approximation of a non-linear curve from a sparse data set by means of RBFNs. However, there are no procedures that permit to define constrains on the derivatives of the curve. In this paper, the Orthogonal Least Squares algorithm for the identification of RBFNs is modified to provide the approximation of a non-linear 1-in 1-out map along with its derivatives, given a set of training data. The interest on the derivatives of non-linear functions concerns many identification and control tasks where the study of system stability and robustness is addressed. The effectiveness of the proposed algorithm is demonstrated by a study on the stability of a single loop feedback system.

This paper concerns the bimodal transmission of emotive speech and describes how the expression o... more This paper concerns the bimodal transmission of emotive speech and describes how the expression of joy, surprise, sadness, disgust, anger, and fear, leads to visual and acoustic target modifications in some Italian phonemes. Current knowledge on the audio-visual transmission of emotive speech traditionally concerns global prosodic and intonational characteristics of speech and facial configurations. In this research we intend to integrate this approach with the analysis of the interaction between labial configurations, peculiar to each emotion, and the articulatory lip movements defined by phonetic-phonological rules, specific to the vowels and consonants /'a/, /b/, /v/ ([1], [2]). Moreover, we present the correlations between articulatory data and the spectral features of the co-produced acoustic signal 1 .

In order to speed-up the procedure for building an emotive/expressive talking head such as LUCIA,... more In order to speed-up the procedure for building an emotive/expressive talking head such as LUCIA, an integrated software called INTERFACE was designed and implemented in Matlab©. INTERFACE simplifies and automates many of the operations needed for that purpose. A set of processing tools, focusing mainly on dynamic articulatory data physically extracted by an automatic optotracking 3D movement analyzer, was implemented in order to build up the animation engine, that is based on the Cohen-Massaro coarticulation model, and also to create the correct WAV and FAP files needed for the animation. LUCIA, our animated MPEG-4 talking face, in fact, can copy a real human by reproducing the movements of some markers positioned on his face and recorded by an optoelectronic device, or can be directly driven by an emotional XML tagged input text, thus realizing a true audio visual emotive/expressive synthesis. LUCIA's voice is based on an Italian version of FESTIVAL -MBROLA packages, modified for expressive/emotive synthesis by means of an appropriate APML/VSML tagged language.

Voice quality is recognized to play an important role for the rendering of emotions in verbal com... more Voice quality is recognized to play an important role for the rendering of emotions in verbal communication. In this paper we explore the effectiveness of a sinusoidal modeling processing framework for voice transformations finalized to the analysis and synthesis of emotive speech. A set of acoustic cues is selected to compare the voice quality characteristics of the speech signals on a voice corpus in which different emotions are reproduced. The sinusoidal signal processing tool is used to convert a neutral utterance into emotive utterances. Two different procedures are applied and compared: in the first one, only the alignment of phoneme duration and of pitch contour is performed; the second procedure refines the transformations by using a spectral conversion function. This refinement improves the reproduction of the different voice qualities of the target emotive utterances. The acoustic cues extracted from the transformed utterances are compared to the emotive original utterances, and the properties and quality of the transformation method are discussed.

A general data-driven procedure for creating new prosodic modules for the Italian FESTIVAL Text-T... more A general data-driven procedure for creating new prosodic modules for the Italian FESTIVAL Text-To-Speech (TTS) [1] synthesizer is described. These modules are based on the "Classification and Regression Trees" (CART) theory. The prosodic factors taken into consideration are: duration, pitch and loudness. Loudness control has been implemented as an extension to the MBROLA diphone concatenative synthesizer. The prosodic models were trained using two speech corpora with different speaking style, and the effectiveness of the CART-based prosody was assessed with a set of evaluation tests.

A new sinusoidal model based engine for FESTIVAL TTS system which performs the DSP (Digital Signa... more A new sinusoidal model based engine for FESTIVAL TTS system which performs the DSP (Digital Signal Processing) operations (i.e. converting a phonetic input into audio signal) of a diphone-based TTS concatenative system, taking as input the NLP (Natural Language Processing) data (a sequence of phonemes with length and intonation values elaborated from the text script) computed by FESTIVAL is described.

INTERFACE is an integrated software implemented in Matlab© and created to speed-up the procedure ... more INTERFACE is an integrated software implemented in Matlab© and created to speed-up the procedure for building an emotive/expressive talking head. Various processing tools, working on dynamic articulatory data physically extracted by an optotracking 3D movement analyzer called ELITE, were implemented to build the animation engine and also to create the correct WAV and FAP files needed for the animation. By the use of INTERFACE, LUCIA, our animated MPEG-4 talking face, can copy a real human by reproducing the movements of passive markers positioned on his face and recorded by an optoelectronic device, or can be directly driven by an emotional XML tagged input text, thus realizing a true audio/visual emotive/expressive synthesis. LUCIA’s voice is based on an Italian version of FESTIVAL – MBROLA packages, modified for expressive/emotive synthesis by means of an appropriate APML/VSML tagged language.

Speech Communication, 2004

This paper describes how the visual and acoustic characteristics of some Italian phones (/'a/, /b... more This paper describes how the visual and acoustic characteristics of some Italian phones (/'a/, /b/, /v/) are modified in emotive speech by the expression of joy, surprise, sadness, disgust, anger, and fear. In this research we specifically analyze the interaction between labial configurations, peculiar to each emotion, and the articulatory lip movements of the Italian vowel /'a/ and consonants /b/ and /v/, defined by phonetic-phonological rules. This interaction was quantified examining the variations of the following parameters: lip opening, upper and lower lip vertical displacements, lip rounding, anterior/posterior movements (protrusion) of upper lip and lower lip, left and right lip corner horizontal displacements, left and right corner vertical displacements, and asymmetry parameters calculated as the difference between right and left corner position along the horizontal and the vertical axes.

Speech production in general, and emotional speech in particular, is characterized by a wide vari... more Speech production in general, and emotional speech in particular, is characterized by a wide variety of phonation modalities. Voice quality, which is the term commonly used in the field, has an important role in the communication of emotions through speech, and nonmodal phonation modalities (soft, breathy, whispery, creaky, for example) are commonly found in emotional speech corpora. In this paper, we describe a voice synthesis framework that allows to control a set of acoustic parameters which are relevant for the simulation of nonmodal voice qualities. The set of controls of the synthesizer includes standard controls for duration and pitch of the phonemes, and additional controls for intensity, spectral emphasis, fast and slow variations of the duration and amplitude of the waveform periods (for voiced frames), frequency axis warping for changing the formant position, and aspiration noise level. Some guidelines are given to combine these signal transformations in the aim of reproducing some nonmodal voice qualities, including soft, loud, breathy, whispery, hoarse, and tremulous voice. It is also discussed how these voice qualities characterize the emotional speech . The system described here is based on the FESTIVAL speech synthesis framework and on the MBROLA diphone concatenation acoustic back-end. We also address the possibility of including affective tags in the input text to be converted. To this aim, FESTIVAL was provided with the support for the use of affective tags through ad-hoc mark-up languages (APML/VSML), and for driving the extended MBROLA synthesis engine through the generation of voice quality controls. The control of the acoustic characteristics of the voice signal is based on signal processing routines applied to the diphones before the concatenation step. Time-domain algorithms are used for the cues related to pitch control, whereas frequency-domain algorithms, based on FFT and inverse-FFT, are used for the cues related to the short-term spectral envelope of the signal.

Informatics in Medicine Unlocked, 2020

Signal Processing, 2003

Computing Research Repository, 2000

Speech Communication, 2004