Masami Akamine | Toshiba - Academia.edu (original) (raw)
Papers by Masami Akamine
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
... Om Prakash* and J. Singh Department of Animal Husbandry and Dairying, Institute of Agricultur... more ... Om Prakash* and J. Singh Department of Animal Husbandry and Dairying, Institute of Agricultural Sciences, BHU, Varanasi-221 005, India ABSTRACT Ten ... vitro procedure was lower (P<0.01) than indicator method which is at variance to the findings reported by Sundaram et al. ...
The Journal of the Acoustical Society of America, 1998
The Journal of the Acoustical Society of America, 1994
IEEE Journal of Selected Topics in Signal Processing, 2000
ABSTRACT The statistical models of hidden Markov model based text-to-speech (HMM-TTS) systems are... more ABSTRACT The statistical models of hidden Markov model based text-to-speech (HMM-TTS) systems are typically built using homogeneous data. It is possible to acquire data from many different sources but combining them leads to a non-homogeneous or diverse dataset. This paper describes the application of average voice models (AVMs) and a novel application of cluster adaptive training (CAT) with multiple context dependent decision trees to create HMM-TTS voices using diverse data: speech data recorded in studios mixed with speech data obtained from the internet. Training AVM and CAT models on diverse data yields better quality speech than training on high quality studio data alone. Tests show that CAT is able to create a voice for a target speaker with as little as 7 seconds; an AVM would need more data to reach the same level of similarity to target speaker. Tests also show that CAT produces higher quality voices than AVMs irrespective of the amount of adaptation data. Lastly, it is shown that it is beneficial to model the data using multiple context clustering decision trees.
Computer Speech & Language, 2013
ABSTRACT This paper presents a study on the importance of short-term speech parameterizations for... more ABSTRACT This paper presents a study on the importance of short-term speech parameterizations for expressive statistical parametric synthesis. Assuming a source-filter model of speech production, the analysis is conducted over spectral parameters, here defined as features which represent a minimum-phase synthesis filter, and some excitation parameters, which are features used to construct a signal that is fed to the minimum-phase synthesis filter to generate speech. In the first part, different spectral and excitation parameters that are applicable to statistical parametric synthesis are tested to determine which ones are the most emotion dependent. The analysis is performed through two methods proposed to measure the relative emotion dependency of each feature: one based on K-means clustering, and another based on Gaussian mixture modeling for emotion identification. Two commonly used forms of parameters for the short-term speech spectral envelope, the Mel cepstrum and the Mel line spectrum pairs are utilized. As excitation parameters, the anti-causal cepstrum, the time-smoothed group delay, and band-aperiodicity coefficients are considered. According to the analysis, the line spectral pairs are the most emotion dependent parameters. Among the excitation features, the band-aperiodicity coefficients present the highest correlation with the speaker's emotion. The most emotion dependent parameters according to this analysis were selected to train an expressive statistical parametric synthesizer using a speaker and language factorization framework. Subjective test results indicate that the considered spectral parameters have a bigger impact on the synthesized speech emotion when compared with the excitation ones.
Systems and Computers in Japan, 2007
ABSTRACT
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
This paper presents a rapid voice adaptation algorithm using GMM-based frequency warping and shif... more This paper presents a rapid voice adaptation algorithm using GMM-based frequency warping and shift with parameters of a subband basis spectrum model (SBM)[1]. The SBM parameter represents a shape of a spectrum of speech. It is calculated by fitting a sub-band basis to the log-spectrum. Since the parameter is the frequency domain representation, frequency warping can be directly applied to the SBM parameter. A frequency warping function that minimize the distance between source and target SBM parameter pairs in each mixture component of a GMM is derived using a DP (Dynamic programming) algorithm. The proposed method is evaluated in an unit-selection based voice adaptation framework applied to a unit-fusion based text-to-speech synthesizer. The experimental results show that the proposed adaptation method is effective for rapid voice adaptation using just one sentence, compared to the conventional GMM.-based linear transformation of mel-cepstra.
Systems and Computers in Japan, 1999
ABSTRACT
Electronics and Communications in Japan (Part III: Fundamental Electronic Science), 2000
This paper presents a variable bit rate ADP-CELP (Adaptive Density Pulse Code Excited Linear Pred... more This paper presents a variable bit rate ADP-CELP (Adaptive Density Pulse Code Excited Linear Prediction) coder that selects one of four kinds of coding structure in each frame based on short time speech characteristics. To improve speech quality and reduce the average bit rate, we have developed a speech/non-speech classification method using spectrum envelope variation, which is robust for background noise. In addition, we propose an efficient pitch lag coding technique. The technique interpolates consecutive frame pitch lags and quantizes a vector of relative pitch lags consisting of variation between an estimated pitch lag and a target pitch lag in plural subframes. The average bit rate of the proposed coder was approximately 2.4 kbps for speech sources with activity factor of 60%. Our subjective testing indicates the quality of the propcsed coder exceeds that of the Japanese digital cellular standard with rate of 3.45 kbps.
Proceedings - IEEE International Symposium on Circuits and Systems
ABSTRACT
The linguistic features analysis for input text plays an important role in achieving natural pros... more The linguistic features analysis for input text plays an important role in achieving natural prosodic control in text-to-speech (TTS) systems. In a conventional scheme, experts refine suspicious if-then rules and change the tree structure manually to obtain correct analysis results when input texts that have been analyzed incorrectly. However, altering the tree structure drastically is difficult since attention is often paid only to the suspicious if-then rules.
Toshiba English Text-to-Speech Synthesizer utilizes several new techniques to produce synthesized... more Toshiba English Text-to-Speech Synthesizer utilizes several new techniques to produce synthesized speech that is more natural-sounding and intelligible than that created by conventional synthesizers. The closed-loop training method creates synthesis units that most closely resemble the training data and are the least susceptible to prosodic distortion noise by analytically solving an equation that minimizes distortion between target units and training data. The pitch contour model creates a codebook of representative word-based F0 contours by first clustering the training data using word stress and syllable numbers. Within each cluster, the training data is divided into different groups using lexical and phonological attributes of each word. In each group, a representative contour is created using approximate error estimation. The resulting approximate errors are used in offset level prediction for each contour.
This paper presents recent developments at our site toward speech recognition using decision tree... more This paper presents recent developments at our site toward speech recognition using decision tree based acoustic models. Previously, robust decision trees have been shown to achieve better performance compared to standard Gaussian mixture model (GMM) acoustic models. This was achieved by converting hard questions (decisions) of a standard tree into soft questions using sigmoid function. In this paper, we report our work where soft-decision trees are trained from scratch. These softdecision trees are shown to yield better speech recognition accuracy compared to standard GMM acoustic models on Aurora digit recognition task.
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
... Om Prakash* and J. Singh Department of Animal Husbandry and Dairying, Institute of Agricultur... more ... Om Prakash* and J. Singh Department of Animal Husbandry and Dairying, Institute of Agricultural Sciences, BHU, Varanasi-221 005, India ABSTRACT Ten ... vitro procedure was lower (P<0.01) than indicator method which is at variance to the findings reported by Sundaram et al. ...
The Journal of the Acoustical Society of America, 1998
The Journal of the Acoustical Society of America, 1994
IEEE Journal of Selected Topics in Signal Processing, 2000
ABSTRACT The statistical models of hidden Markov model based text-to-speech (HMM-TTS) systems are... more ABSTRACT The statistical models of hidden Markov model based text-to-speech (HMM-TTS) systems are typically built using homogeneous data. It is possible to acquire data from many different sources but combining them leads to a non-homogeneous or diverse dataset. This paper describes the application of average voice models (AVMs) and a novel application of cluster adaptive training (CAT) with multiple context dependent decision trees to create HMM-TTS voices using diverse data: speech data recorded in studios mixed with speech data obtained from the internet. Training AVM and CAT models on diverse data yields better quality speech than training on high quality studio data alone. Tests show that CAT is able to create a voice for a target speaker with as little as 7 seconds; an AVM would need more data to reach the same level of similarity to target speaker. Tests also show that CAT produces higher quality voices than AVMs irrespective of the amount of adaptation data. Lastly, it is shown that it is beneficial to model the data using multiple context clustering decision trees.
Computer Speech & Language, 2013
ABSTRACT This paper presents a study on the importance of short-term speech parameterizations for... more ABSTRACT This paper presents a study on the importance of short-term speech parameterizations for expressive statistical parametric synthesis. Assuming a source-filter model of speech production, the analysis is conducted over spectral parameters, here defined as features which represent a minimum-phase synthesis filter, and some excitation parameters, which are features used to construct a signal that is fed to the minimum-phase synthesis filter to generate speech. In the first part, different spectral and excitation parameters that are applicable to statistical parametric synthesis are tested to determine which ones are the most emotion dependent. The analysis is performed through two methods proposed to measure the relative emotion dependency of each feature: one based on K-means clustering, and another based on Gaussian mixture modeling for emotion identification. Two commonly used forms of parameters for the short-term speech spectral envelope, the Mel cepstrum and the Mel line spectrum pairs are utilized. As excitation parameters, the anti-causal cepstrum, the time-smoothed group delay, and band-aperiodicity coefficients are considered. According to the analysis, the line spectral pairs are the most emotion dependent parameters. Among the excitation features, the band-aperiodicity coefficients present the highest correlation with the speaker's emotion. The most emotion dependent parameters according to this analysis were selected to train an expressive statistical parametric synthesizer using a speaker and language factorization framework. Subjective test results indicate that the considered spectral parameters have a bigger impact on the synthesized speech emotion when compared with the excitation ones.
Systems and Computers in Japan, 2007
ABSTRACT
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
This paper presents a rapid voice adaptation algorithm using GMM-based frequency warping and shif... more This paper presents a rapid voice adaptation algorithm using GMM-based frequency warping and shift with parameters of a subband basis spectrum model (SBM)[1]. The SBM parameter represents a shape of a spectrum of speech. It is calculated by fitting a sub-band basis to the log-spectrum. Since the parameter is the frequency domain representation, frequency warping can be directly applied to the SBM parameter. A frequency warping function that minimize the distance between source and target SBM parameter pairs in each mixture component of a GMM is derived using a DP (Dynamic programming) algorithm. The proposed method is evaluated in an unit-selection based voice adaptation framework applied to a unit-fusion based text-to-speech synthesizer. The experimental results show that the proposed adaptation method is effective for rapid voice adaptation using just one sentence, compared to the conventional GMM.-based linear transformation of mel-cepstra.
Systems and Computers in Japan, 1999
ABSTRACT
Electronics and Communications in Japan (Part III: Fundamental Electronic Science), 2000
This paper presents a variable bit rate ADP-CELP (Adaptive Density Pulse Code Excited Linear Pred... more This paper presents a variable bit rate ADP-CELP (Adaptive Density Pulse Code Excited Linear Prediction) coder that selects one of four kinds of coding structure in each frame based on short time speech characteristics. To improve speech quality and reduce the average bit rate, we have developed a speech/non-speech classification method using spectrum envelope variation, which is robust for background noise. In addition, we propose an efficient pitch lag coding technique. The technique interpolates consecutive frame pitch lags and quantizes a vector of relative pitch lags consisting of variation between an estimated pitch lag and a target pitch lag in plural subframes. The average bit rate of the proposed coder was approximately 2.4 kbps for speech sources with activity factor of 60%. Our subjective testing indicates the quality of the propcsed coder exceeds that of the Japanese digital cellular standard with rate of 3.45 kbps.
Proceedings - IEEE International Symposium on Circuits and Systems
ABSTRACT
The linguistic features analysis for input text plays an important role in achieving natural pros... more The linguistic features analysis for input text plays an important role in achieving natural prosodic control in text-to-speech (TTS) systems. In a conventional scheme, experts refine suspicious if-then rules and change the tree structure manually to obtain correct analysis results when input texts that have been analyzed incorrectly. However, altering the tree structure drastically is difficult since attention is often paid only to the suspicious if-then rules.
Toshiba English Text-to-Speech Synthesizer utilizes several new techniques to produce synthesized... more Toshiba English Text-to-Speech Synthesizer utilizes several new techniques to produce synthesized speech that is more natural-sounding and intelligible than that created by conventional synthesizers. The closed-loop training method creates synthesis units that most closely resemble the training data and are the least susceptible to prosodic distortion noise by analytically solving an equation that minimizes distortion between target units and training data. The pitch contour model creates a codebook of representative word-based F0 contours by first clustering the training data using word stress and syllable numbers. Within each cluster, the training data is divided into different groups using lexical and phonological attributes of each word. In each group, a representative contour is created using approximate error estimation. The resulting approximate errors are used in offset level prediction for each contour.
This paper presents recent developments at our site toward speech recognition using decision tree... more This paper presents recent developments at our site toward speech recognition using decision tree based acoustic models. Previously, robust decision trees have been shown to achieve better performance compared to standard Gaussian mixture model (GMM) acoustic models. This was achieved by converting hard questions (decisions) of a standard tree into soft questions using sigmoid function. In this paper, we report our work where soft-decision trees are trained from scratch. These softdecision trees are shown to yield better speech recognition accuracy compared to standard GMM acoustic models on Aurora digit recognition task.