João Cabral - Academia.edu (original) (raw)
Papers by João Cabral
The control over aspects of the glottal source signal is fundamental to correctly modify relevant... more The control over aspects of the glottal source signal is fundamental to correctly modify relevant voice characteristics, such as breath-iness. This voice quality is strongly related to the characteristics of the glottal source signal produced at the glottis, mainly the shape of the glottal pulse and the aspiration noise. This type of noise results from the turbulence of air passing through the glottis and it can be represented by an amplitude modulated Gaussian noise, which depends on the glot-tal volume velocity and glottal area. However, the dependency between the glottal signal and the noise component is usually not taken into account for transforming breathiness. In this paper, we propose a method for modelling the aspiration noise which permits to adapt the aspiration noise to take into account its dependency with the glottal pulse shape, while producing high-quality speech. The envelope of the amplitude modulated noise is estimated from the speech signal pitch-synchronously and then it is parameterized by using a non-linear polynomial fitting algorithm. Finally, an asymmetric triangular window is obtained from the non-linear polynomial representation for obtaining a shape of the energy envelope of the noise closer to that of the glottal source. In the experiments for voice transformation, both the proposed aspiration noise model and an acoustic glottal source model are used to transform a modal voice into breathy. Results show that the aspiration noise model improves the voice quality transformation compared with an excitation using only the glottal model and an excitation that combines the glottal source model and a spectral representation of the noise component.
A major factor which causes a deterioration in speech quality in HMM-based speech synthesis is th... more A major factor which causes a deterioration in speech quality in HMM-based speech synthesis is the use of a simple delta pulse signal to generate the excitation of voiced speech. This paper sets out a new approach to using an acoustic glottal source model in HMM-based synthesisers instead of the traditional pulse signal. The goal is to improve speech quality and to better model and transform voice characteristics. We have found the new method decreases buzziness and also improves prosodic modelling. A perceptual evaluation has supported this finding by showing a 55.6% preference for the new system , as against the baseline. This improvement, while not being as significant as we had initially expected, does encourage us to work on developing the proposed speech synthesiser further.
According to the source-filter model of speech production , speech can be represented by passing ... more According to the source-filter model of speech production , speech can be represented by passing the excitation signal through the vocal tract filter. The epoch or instant of maximum excitation corresponds to the glottal closure instant. Several speech processing applications require robust epoch detection but this can be a difficult task. Although state-of-the-art epoch estimation methods can produce reliable results, they are generally evaluated using speech recorded with a neutral voice quality (modal voice). This paper reviews and evaluates six popular algorithms for the calculation of glottal closure instants on speech spoken with modal voice and seven additional voice qualities. Results show that the performance of each method is affected by the voice type and that some methods perform better than others for each voice quality.
This paper describes a prototype of a computer-assisted pronunciation training system called MySp... more This paper describes a prototype of a computer-assisted pronunciation training system called MySpeech. The interface of the MySpeech system is web-based and it currently enables users to practice pronunciation by listening to speech spoken by native speakers and tuning their speech production to correct any mispronunciations detected by the system. This practice exercise is facilitated in different topics and difficulty levels. An experiment was conducted in this work that combines the MySpeech service with the WebWOZ Wizard-of-Oz platform (http://www.webwoz.com), in order to improve the human-computer interaction (HCI) of the service and the feedback that it provides to the user. The employed Wizard-of-Oz method enables a human (who acts as a wizard) to give feedback to the practising user, while the user is not aware that there is another person involved in the communication. This experiment permitted to quickly test an HCI model before its implementation on the MySpeech system. It also allowed to collect input data from the wizard that can be used to improve the proposed model. Another outcome of the experiment was the preliminary evaluation of the pronunciation learning service in terms of user satisfaction, which would be difficult to conduct before integrating the HCI part.
Generating emotions in speech is currently an important topic of research given the requirement o... more Generating emotions in speech is currently an important topic of research given the requirement of modern human-machine interaction systems to produce expressive speech. We present the EmoVoice system, which implements acoustic rules to simulate seven basic emotions in neutral speech. It uses the pitch-synchronous time-scaling (PSTS) of the excitation signal to change the prosody and the most relevant glottal source parameters related to voice quality. The system also transforms other parameters of the vocal source signal to produce different types of irregular voicing quality. The correlation of the speech parameters with the basic emotions was derived from measurements of the glottal parameters and from results reported by other authors. The evaluation of the system showed that it can generate recognizable emotions but improvements are still necessary to discriminate some pairs of emotions.
Current time-domain pitch modification techniques have well known limitations for large variation... more Current time-domain pitch modification techniques have well known limitations for large variations of the original fundamental frequency. This paper proposes a technique for changing the pitch and duration of a speech signal based on time-scaling the linear prediction (LP) residual. The resulting speech signal achieves better quality than the traditional LP-PSOLA method for large fundamental frequency modifications. By using non-uniform time-scaling, this technique can also change the shape of the LP residual for each pitch period. This way we can simulate changes of the most relevant glottal source parameters like the open quotient, the spectral tilt and the asymmetry coefficient. Careful adjustments of these source parameters allows the transformation of the original speech signal so that it is perceived as if it was uttered with a different voice quality or emotion .
The emerging new applications of synthetic characters, as a way to achieve more natural interacti... more The emerging new applications of synthetic characters, as a way to achieve more natural interactions, puts new demands on the synthetic voices, in order to fulfill the expectations of the user. The work presented in this paper evaluates a synthetic voice used by a synthetic character in a storytelling situation. To allow for a better comparison, a real actor was filmed telling a children's story. The pitch, duration and energy of the recorded speech were copied to the synthetic speech generated with a FESTIVAL-based LPC-diphone synthesizer. At the same time, the synthetic character was also animated with the gestures, emotions and facial expressions used by the actor. Using different conditions combining the synthetic voice, synthetic character with the real voice, and the real character, the voice was evaluated regarding the comprehension of the storyteller , the expression of emotions, its credibility and the user satisfaction.
The control over aspects of the glottal source signal is fundamental to correctly modify relevant... more The control over aspects of the glottal source signal is fundamental to correctly modify relevant voice characteristics, such as breath-iness. This voice quality is strongly related to the characteristics of the glottal source signal produced at the glottis, mainly the shape of the glottal pulse and the aspiration noise. This type of noise results from the turbulence of air passing through the glottis and it can be represented by an amplitude modulated Gaussian noise, which depends on the glot-tal volume velocity and glottal area. However, the dependency between the glottal signal and the noise component is usually not taken into account for transforming breathiness. In this paper, we propose a method for modelling the aspiration noise which permits to adapt the aspiration noise to take into account its dependency with the glottal pulse shape, while producing high-quality speech. The envelope of the amplitude modulated noise is estimated from the speech signal pitch-synchronously and then it is parameterized by using a non-linear polynomial fitting algorithm. Finally, an asymmetric triangular window is obtained from the non-linear polynomial representation for obtaining a shape of the energy envelope of the noise closer to that of the glottal source. In the experiments for voice transformation, both the proposed aspiration noise model and an acoustic glottal source model are used to transform a modal voice into breathy. Results show that the aspiration noise model improves the voice quality transformation compared with an excitation using only the glottal model and an excitation that combines the glottal source model and a spectral representation of the noise component.
A major factor which causes a deterioration in speech quality in HMM-based speech synthesis is th... more A major factor which causes a deterioration in speech quality in HMM-based speech synthesis is the use of a simple delta pulse signal to generate the excitation of voiced speech. This paper sets out a new approach to using an acoustic glottal source model in HMM-based synthesisers instead of the traditional pulse signal. The goal is to improve speech quality and to better model and transform voice characteristics. We have found the new method decreases buzziness and also improves prosodic modelling. A perceptual evaluation has supported this finding by showing a 55.6% preference for the new system , as against the baseline. This improvement, while not being as significant as we had initially expected, does encourage us to work on developing the proposed speech synthesiser further.
According to the source-filter model of speech production , speech can be represented by passing ... more According to the source-filter model of speech production , speech can be represented by passing the excitation signal through the vocal tract filter. The epoch or instant of maximum excitation corresponds to the glottal closure instant. Several speech processing applications require robust epoch detection but this can be a difficult task. Although state-of-the-art epoch estimation methods can produce reliable results, they are generally evaluated using speech recorded with a neutral voice quality (modal voice). This paper reviews and evaluates six popular algorithms for the calculation of glottal closure instants on speech spoken with modal voice and seven additional voice qualities. Results show that the performance of each method is affected by the voice type and that some methods perform better than others for each voice quality.
This paper describes a prototype of a computer-assisted pronunciation training system called MySp... more This paper describes a prototype of a computer-assisted pronunciation training system called MySpeech. The interface of the MySpeech system is web-based and it currently enables users to practice pronunciation by listening to speech spoken by native speakers and tuning their speech production to correct any mispronunciations detected by the system. This practice exercise is facilitated in different topics and difficulty levels. An experiment was conducted in this work that combines the MySpeech service with the WebWOZ Wizard-of-Oz platform (http://www.webwoz.com), in order to improve the human-computer interaction (HCI) of the service and the feedback that it provides to the user. The employed Wizard-of-Oz method enables a human (who acts as a wizard) to give feedback to the practising user, while the user is not aware that there is another person involved in the communication. This experiment permitted to quickly test an HCI model before its implementation on the MySpeech system. It also allowed to collect input data from the wizard that can be used to improve the proposed model. Another outcome of the experiment was the preliminary evaluation of the pronunciation learning service in terms of user satisfaction, which would be difficult to conduct before integrating the HCI part.
Generating emotions in speech is currently an important topic of research given the requirement o... more Generating emotions in speech is currently an important topic of research given the requirement of modern human-machine interaction systems to produce expressive speech. We present the EmoVoice system, which implements acoustic rules to simulate seven basic emotions in neutral speech. It uses the pitch-synchronous time-scaling (PSTS) of the excitation signal to change the prosody and the most relevant glottal source parameters related to voice quality. The system also transforms other parameters of the vocal source signal to produce different types of irregular voicing quality. The correlation of the speech parameters with the basic emotions was derived from measurements of the glottal parameters and from results reported by other authors. The evaluation of the system showed that it can generate recognizable emotions but improvements are still necessary to discriminate some pairs of emotions.
Current time-domain pitch modification techniques have well known limitations for large variation... more Current time-domain pitch modification techniques have well known limitations for large variations of the original fundamental frequency. This paper proposes a technique for changing the pitch and duration of a speech signal based on time-scaling the linear prediction (LP) residual. The resulting speech signal achieves better quality than the traditional LP-PSOLA method for large fundamental frequency modifications. By using non-uniform time-scaling, this technique can also change the shape of the LP residual for each pitch period. This way we can simulate changes of the most relevant glottal source parameters like the open quotient, the spectral tilt and the asymmetry coefficient. Careful adjustments of these source parameters allows the transformation of the original speech signal so that it is perceived as if it was uttered with a different voice quality or emotion .
The emerging new applications of synthetic characters, as a way to achieve more natural interacti... more The emerging new applications of synthetic characters, as a way to achieve more natural interactions, puts new demands on the synthetic voices, in order to fulfill the expectations of the user. The work presented in this paper evaluates a synthetic voice used by a synthetic character in a storytelling situation. To allow for a better comparison, a real actor was filmed telling a children's story. The pitch, duration and energy of the recorded speech were copied to the synthetic speech generated with a FESTIVAL-based LPC-diphone synthesizer. At the same time, the synthetic character was also animated with the gestures, emotions and facial expressions used by the actor. Using different conditions combining the synthetic voice, synthetic character with the real voice, and the real character, the voice was evaluated regarding the comprehension of the storyteller , the expression of emotions, its credibility and the user satisfaction.