Daniel Tihelka - Academia.edu (original) (raw)
Papers by Daniel Tihelka
Springer eBooks, 2009
In the present paper, several experiments on text-to-speech system personification are described.... more In the present paper, several experiments on text-to-speech system personification are described. The personification enables TTS system to produce new voices by employing voice conversion methods. The baseline speech synthetizer is a concatenative corpus-based TTS system which utilizes the unit selection method. The voice identity change is performed by the transformation of spectral envelope, spectral detail and pitch. Two different personification approaches are compared in this paper. The former is based on the transformation of the original speech corpus, the latter transforms the output of the synthesizer. Specific advantages and disadvantages of both approaches are discussed and their performance is compared in listening tests.
Springer eBooks, Aug 18, 2007
This paper deals with the problem of speech waveform polarity. As the polarity of speech waveform... more This paper deals with the problem of speech waveform polarity. As the polarity of speech waveform can influence the performance of pitch marking algorithms (see Sec. 4), a simple method for the speech signal polarity determination is presented in the paper. We call this problem peak/valley decision making, i.e. making of decision whether pitch marks should be placed at peaks (local maxima) or at valleys (local minima) of a speech waveform. Besides, the proposed method can be utilized to check the polarity consistence of a speech corpus, which is important for the concatenation of speech units in speech synthesis.
Lecture Notes in Computer Science, Nov 11, 2022
Conference of the International Speech Communication Association, 2017
The presented paper is focused on the building of personalized text-to-speech (TTS) synthesis for... more The presented paper is focused on the building of personalized text-to-speech (TTS) synthesis for people who are losing their voices due to fatal diseases. The special conditions of this issue make the process different from preparing professional synthetic voices for commercial TTS systems and make it also more difficult. The whole process is described in this paper and the first results of the personalized voice building are presented here as well.
SSW, Aug 31, 2013
This paper presents a new analytic method that can be used for analyzing perceptual relevance of ... more This paper presents a new analytic method that can be used for analyzing perceptual relevance of unit selection costs and/or their sub-components as well as for tuning of unit selection weights. The proposed method is leveraged to investigate the behavior of a unit selection based system. The outcome is applied in a simple experiment with the aim to improve speech output quality of the system by setting limits on the costs and their sub-components during the search for optimal sequences of units. The experiments reveal that a large number (36.17%) of artifacts annotated by listeners are not reflected by the values of the costs and their sub-componets as currently implemented and tuned in the evaluated system.
We introduce a unified Grapheme-to-phoneme conversion framework based on the composition of deep ... more We introduce a unified Grapheme-to-phoneme conversion framework based on the composition of deep neural networks. In contrary to the usual approaches building the G2P frameworks from the dictionary, we use whole phrases, which allows us to capture various language properties, e.g. crossword assimilation, without the need for any special care or topology adjustments. The evaluation is carried out on three different languages-English, Czech and Russian. Each requires dealing with specific properties, stressing the proposed framework in various ways. The very first results show promising performance of the proposed framework, dealing with all the phenomena specific to the tested languages. Thus, we consider the framework to be language-independent for a wide range of languages.
In this paper we adopt several anomaly detection methods to detect annotation errors in single-sp... more In this paper we adopt several anomaly detection methods to detect annotation errors in single-speaker read-speech corpora used for text-to-speech (TTS) synthesis. Correctly annotated words are considered as normal examples on which the detection methods are trained. Misannotated words are then taken as anomalous examples which do not conform to normal patterns of the trained detection models. Word-level feature sets including basic features derived from forced alignment, and various acoustic, spectral, phonetic, and positional features were examined. Dimensionality reduction techniques were also applied to reduce the number of features. The first results with F 1 score being almost 89% show that anomaly detection could help in detecting annotation errors in read-speech corpora for TTS synthesis.
Lecture Notes in Computer Science, 2014
In the common unit selection implementations, F0 continuity is measured as one of concatenation c... more In the common unit selection implementations, F0 continuity is measured as one of concatenation cost features with the expectation that smooth units transition (regarding speech melody) is ensured when the difference of F0 is low enough. This measure generally uses a static F0 value computed at the units boundary. In the present paper we show, however, that the use of static F0 values is not enough for smooth speech units concatenation, and that a dynamic nature of the F0 contour must be taken into account. Two schemes of dynamic F0 handling are presented, and speech generated by both schemes is compared by means of listening tests on specially selected phrases which are known to carry unnatural artefacts. Advantages and disadvantages of the individual schemes are also discussed.
Language Resources and Evaluation, May 26, 2004
This paper presents an attempt to design listening tests for the Czech synthesis speech evaluatio... more This paper presents an attempt to design listening tests for the Czech synthesis speech evaluation. The design is based on standardized and widely used listening tests for English; therefore, we can benefit from the advantages provided by standards. Bearing the Czech language phenomena in mind, we filled the standard frameworks of several listening tests, especially the MRT (Modified Rhyme Test) and the SUS (Semantically Unpredictable Sentences) test; the Czech National Corpus was used for this purpose. Designed tests were instantly used for real tests in which 88 people took part, a procedure which proved correct. This was the first attempt to design Czech listening tests according to given standard frameworks and it was successful.
We investigate the problem of automatic detection of annotation errors in single-speaker read-spe... more We investigate the problem of automatic detection of annotation errors in single-speaker read-speech corpora used for textto-speech (TTS) synthesis. Various word-level feature sets were used, and the performance of several detection methods based on support vector machines, extremely randomized trees, knearest neighbors, and the performance of novelty and outlier detection are evaluated. We show that both word-and utterancelevel annotation error detections perform very well with both high precision and recall scores and with F 1 measure being almost 90%, or 97%, respectively.
Speech Communication, Apr 1, 2011
A large number of methods for identifying glottal closure instants (GCIs) in voiced speech have b... more A large number of methods for identifying glottal closure instants (GCIs) in voiced speech have been proposed in recent years. In this paper, we propose to take advantage of both glottal and speech signals in order to increase the accuracy of detection of GCIs. All aspects of this particular issue, from determining speech polarity to handling a delay between glottal and corresponding speech signal, are addressed. A robust multi-phase algorithm (MPA), which combines different methods applied on both signals in a unique way, is presented. Within the process, a special attention is paid to determination of speech waveform polarity, as it was found to be considerably influencing the performance of the detection algorithms. Another feature of the proposed method is that every detected GCI is given a confidence score, which allows to locate potentially inaccurate GCI subsequences. The performance of the proposed algorithm was tested and compared with other freely available GCI detection algorithms. The MPA algorithm was found to be more robust in terms of detection accuracy over various sets of sentences, languages and phone classes. Finally, some pitfalls of the GCI detection are discussed.
Lecture Notes in Computer Science, 2016
Current unit selection speech synthesis systems are capable of producing speech of a high quality... more Current unit selection speech synthesis systems are capable of producing speech of a high quality at the expense of enormous computational and storage requirements. In this paper, the analysis of an existing large speech corpus employed for unitselection-based synthesis of Czech speech is performed. Subsequently, a procedure for the exclusion of some amount of utterances from the source speech corpus is proposed. The procedure is based on the statistics of the utilisation of all utterances during text-to-speech synthesis of a large portion of texts. The exclusion of whole utterances was preferred over the exclusion of the particular instances of speech units in order to preserve the main feature of unit selection framework-to select as longest sequence of contiguous speech units as possible. After the exclusion, the footprint of the system was reduced approximately by 42 %. The resulting synthetic speech was then judged by means of 5-scale CCR listening tests and evaluated in average as only "slightly worse" than speech generated by the baseline (i.e. not reduced) system.
Lecture Notes in Computer Science, 2016
Anomaly detection techniques were shown to help in detecting word-level annotation errors in read... more Anomaly detection techniques were shown to help in detecting word-level annotation errors in read-speech corpora for text-to-speech synthesis. In this framework, correctly annotated words are considered as normal examples on which the detection methods are trained. Misannotated words are then taken as anomalous examples which do not conform to normal patterns of the trained detection models. As it could be hard to collect a sufficient number of examples to train and optimize an anomaly detector, in this paper we investigate the influence of the number of anomalous and normal examples on the detection accuracy of several anomaly detection models: Gaussian distribution based models, one-class support vector machines, and Grubbs’ test based model. Our experiments show that the number of examples can be significantly reduced without a large drop in detection accuracy.
In this paper, we continue to investigate the use of machine learning for the automatic detection... more In this paper, we continue to investigate the use of machine learning for the automatic detection of glottal closure instants (GCIs) from raw speech. We compare several deep one-dimensional convolutional neural network architectures on the same data and show that the InceptionV3 model yields the best results on the test set. On publicly available databases, the proposed 1D InceptionV3 outperforms XGBoost, a non-deep machine learning model, as well as other traditional GCI detection algorithms.
Computer Speech & Language, Nov 1, 2017
Springer eBooks, 2009
In the present paper, several experiments on text-to-speech system personification are described.... more In the present paper, several experiments on text-to-speech system personification are described. The personification enables TTS system to produce new voices by employing voice conversion methods. The baseline speech synthetizer is a concatenative corpus-based TTS system which utilizes the unit selection method. The voice identity change is performed by the transformation of spectral envelope, spectral detail and pitch. Two different personification approaches are compared in this paper. The former is based on the transformation of the original speech corpus, the latter transforms the output of the synthesizer. Specific advantages and disadvantages of both approaches are discussed and their performance is compared in listening tests.
Springer eBooks, Aug 18, 2007
This paper deals with the problem of speech waveform polarity. As the polarity of speech waveform... more This paper deals with the problem of speech waveform polarity. As the polarity of speech waveform can influence the performance of pitch marking algorithms (see Sec. 4), a simple method for the speech signal polarity determination is presented in the paper. We call this problem peak/valley decision making, i.e. making of decision whether pitch marks should be placed at peaks (local maxima) or at valleys (local minima) of a speech waveform. Besides, the proposed method can be utilized to check the polarity consistence of a speech corpus, which is important for the concatenation of speech units in speech synthesis.
Lecture Notes in Computer Science, Nov 11, 2022
Conference of the International Speech Communication Association, 2017
The presented paper is focused on the building of personalized text-to-speech (TTS) synthesis for... more The presented paper is focused on the building of personalized text-to-speech (TTS) synthesis for people who are losing their voices due to fatal diseases. The special conditions of this issue make the process different from preparing professional synthetic voices for commercial TTS systems and make it also more difficult. The whole process is described in this paper and the first results of the personalized voice building are presented here as well.
SSW, Aug 31, 2013
This paper presents a new analytic method that can be used for analyzing perceptual relevance of ... more This paper presents a new analytic method that can be used for analyzing perceptual relevance of unit selection costs and/or their sub-components as well as for tuning of unit selection weights. The proposed method is leveraged to investigate the behavior of a unit selection based system. The outcome is applied in a simple experiment with the aim to improve speech output quality of the system by setting limits on the costs and their sub-components during the search for optimal sequences of units. The experiments reveal that a large number (36.17%) of artifacts annotated by listeners are not reflected by the values of the costs and their sub-componets as currently implemented and tuned in the evaluated system.
We introduce a unified Grapheme-to-phoneme conversion framework based on the composition of deep ... more We introduce a unified Grapheme-to-phoneme conversion framework based on the composition of deep neural networks. In contrary to the usual approaches building the G2P frameworks from the dictionary, we use whole phrases, which allows us to capture various language properties, e.g. crossword assimilation, without the need for any special care or topology adjustments. The evaluation is carried out on three different languages-English, Czech and Russian. Each requires dealing with specific properties, stressing the proposed framework in various ways. The very first results show promising performance of the proposed framework, dealing with all the phenomena specific to the tested languages. Thus, we consider the framework to be language-independent for a wide range of languages.
In this paper we adopt several anomaly detection methods to detect annotation errors in single-sp... more In this paper we adopt several anomaly detection methods to detect annotation errors in single-speaker read-speech corpora used for text-to-speech (TTS) synthesis. Correctly annotated words are considered as normal examples on which the detection methods are trained. Misannotated words are then taken as anomalous examples which do not conform to normal patterns of the trained detection models. Word-level feature sets including basic features derived from forced alignment, and various acoustic, spectral, phonetic, and positional features were examined. Dimensionality reduction techniques were also applied to reduce the number of features. The first results with F 1 score being almost 89% show that anomaly detection could help in detecting annotation errors in read-speech corpora for TTS synthesis.
Lecture Notes in Computer Science, 2014
In the common unit selection implementations, F0 continuity is measured as one of concatenation c... more In the common unit selection implementations, F0 continuity is measured as one of concatenation cost features with the expectation that smooth units transition (regarding speech melody) is ensured when the difference of F0 is low enough. This measure generally uses a static F0 value computed at the units boundary. In the present paper we show, however, that the use of static F0 values is not enough for smooth speech units concatenation, and that a dynamic nature of the F0 contour must be taken into account. Two schemes of dynamic F0 handling are presented, and speech generated by both schemes is compared by means of listening tests on specially selected phrases which are known to carry unnatural artefacts. Advantages and disadvantages of the individual schemes are also discussed.
Language Resources and Evaluation, May 26, 2004
This paper presents an attempt to design listening tests for the Czech synthesis speech evaluatio... more This paper presents an attempt to design listening tests for the Czech synthesis speech evaluation. The design is based on standardized and widely used listening tests for English; therefore, we can benefit from the advantages provided by standards. Bearing the Czech language phenomena in mind, we filled the standard frameworks of several listening tests, especially the MRT (Modified Rhyme Test) and the SUS (Semantically Unpredictable Sentences) test; the Czech National Corpus was used for this purpose. Designed tests were instantly used for real tests in which 88 people took part, a procedure which proved correct. This was the first attempt to design Czech listening tests according to given standard frameworks and it was successful.
We investigate the problem of automatic detection of annotation errors in single-speaker read-spe... more We investigate the problem of automatic detection of annotation errors in single-speaker read-speech corpora used for textto-speech (TTS) synthesis. Various word-level feature sets were used, and the performance of several detection methods based on support vector machines, extremely randomized trees, knearest neighbors, and the performance of novelty and outlier detection are evaluated. We show that both word-and utterancelevel annotation error detections perform very well with both high precision and recall scores and with F 1 measure being almost 90%, or 97%, respectively.
Speech Communication, Apr 1, 2011
A large number of methods for identifying glottal closure instants (GCIs) in voiced speech have b... more A large number of methods for identifying glottal closure instants (GCIs) in voiced speech have been proposed in recent years. In this paper, we propose to take advantage of both glottal and speech signals in order to increase the accuracy of detection of GCIs. All aspects of this particular issue, from determining speech polarity to handling a delay between glottal and corresponding speech signal, are addressed. A robust multi-phase algorithm (MPA), which combines different methods applied on both signals in a unique way, is presented. Within the process, a special attention is paid to determination of speech waveform polarity, as it was found to be considerably influencing the performance of the detection algorithms. Another feature of the proposed method is that every detected GCI is given a confidence score, which allows to locate potentially inaccurate GCI subsequences. The performance of the proposed algorithm was tested and compared with other freely available GCI detection algorithms. The MPA algorithm was found to be more robust in terms of detection accuracy over various sets of sentences, languages and phone classes. Finally, some pitfalls of the GCI detection are discussed.
Lecture Notes in Computer Science, 2016
Current unit selection speech synthesis systems are capable of producing speech of a high quality... more Current unit selection speech synthesis systems are capable of producing speech of a high quality at the expense of enormous computational and storage requirements. In this paper, the analysis of an existing large speech corpus employed for unitselection-based synthesis of Czech speech is performed. Subsequently, a procedure for the exclusion of some amount of utterances from the source speech corpus is proposed. The procedure is based on the statistics of the utilisation of all utterances during text-to-speech synthesis of a large portion of texts. The exclusion of whole utterances was preferred over the exclusion of the particular instances of speech units in order to preserve the main feature of unit selection framework-to select as longest sequence of contiguous speech units as possible. After the exclusion, the footprint of the system was reduced approximately by 42 %. The resulting synthetic speech was then judged by means of 5-scale CCR listening tests and evaluated in average as only "slightly worse" than speech generated by the baseline (i.e. not reduced) system.
Lecture Notes in Computer Science, 2016
Anomaly detection techniques were shown to help in detecting word-level annotation errors in read... more Anomaly detection techniques were shown to help in detecting word-level annotation errors in read-speech corpora for text-to-speech synthesis. In this framework, correctly annotated words are considered as normal examples on which the detection methods are trained. Misannotated words are then taken as anomalous examples which do not conform to normal patterns of the trained detection models. As it could be hard to collect a sufficient number of examples to train and optimize an anomaly detector, in this paper we investigate the influence of the number of anomalous and normal examples on the detection accuracy of several anomaly detection models: Gaussian distribution based models, one-class support vector machines, and Grubbs’ test based model. Our experiments show that the number of examples can be significantly reduced without a large drop in detection accuracy.
In this paper, we continue to investigate the use of machine learning for the automatic detection... more In this paper, we continue to investigate the use of machine learning for the automatic detection of glottal closure instants (GCIs) from raw speech. We compare several deep one-dimensional convolutional neural network architectures on the same data and show that the InceptionV3 model yields the best results on the test set. On publicly available databases, the proposed 1D InceptionV3 outperforms XGBoost, a non-deep machine learning model, as well as other traditional GCI detection algorithms.
Computer Speech & Language, Nov 1, 2017