Matthew Aylett | University of Edinburgh (original) (raw)

Uploads

Papers by Matthew Aylett

Research paper thumbnail of The CereVoice Characterful Speech Synthesiser SDK

CereProc® Ltd. have recently released a beta version of a commercial unit selection synthesiser f... more CereProc® Ltd. have recently released a beta version of a commercial unit selection synthesiser featuring XML control of speech style. The system is freely available for academic use and allows fine control of the rendered speech as well as full timings to interface with avatars and other animation.

Research paper thumbnail of The Cerevoice Blizzard Entry 2006: A Prototype Small Database Unit Selection Engine

Cerevoice R is a unit selection speech synthesis system produced by Cereproc Ltd. The system was ... more Cerevoice R is a unit selection speech synthesis system produced by Cereproc Ltd. The system was used to build small and large unit selection databases using the data supplied by the Blizzard Challenge 2006. The large database system was used as a baseline system while two experimental approaches for improving the quality of the small database system were explored. 1) Synthetically generating diphones from sections of diphones in the database offline and then using them in synthesis, a process we term bulking. 2) Applying limited manual intervention based on negative feedback to improve quality, a process we term second-pass synthesis. Both techniques resulted in the small database system maintaining the quality of the larger system. We conclude that there is much room for improvement in the quality of small database systems using unit selection without the requirement for more data and that second-pass synthesis offers a potential means for training small database unit selection systems.

Research paper thumbnail of The Cerevoice Blizzard Entry 2007: Are Small Database Errors Worse than Compression Artifacts

In commercial systems the memory footprint of unit selection systems is often a key issue. This i... more In commercial systems the memory footprint of unit selection systems is often a key issue. This is especially true for PDAs and other embedded devices. In this years Blizzard entry CereProc R gave itself the criteria that the full database system entered would have a smaller memory footprint than either of the two smaller database entries. This was accomplished by applying speex speech compression to the full database entry. In turn a set of small database techniques used to improve the quality of small database systems in last years entry were extended. Finally, for all systems, two quality control methods were applied to the underlying database to improve the lexicon and transcription match to the underlying data.

Research paper thumbnail of The CereProc Blizzard Entry 2009: Some dumb algorithms that don't work

Within unit selection systems there is a constant tension between data sparsity and quality. This... more Within unit selection systems there is a constant tension between data sparsity and quality. This limits the control possible in a unit selection system. The RP data used in Blizzard this year and last year is expressive and spoken in a spirited manner. Last years entry focused on maintaining expressiveness, this year we focused on two simple algorithms to restrain and control this prosodic variation. 1) Variable width valley floor pruning on duration and pitch (Applied to the full database entry EH1), 2) Bulking of data with average HTS data (Applied to small database entry EH2). Results for both techniques were disappointing. The full database system achieved an MOS of around 2 (compared to 4 for a similar system attempting to emphasise variation in 2008), while the small database entry achieved an MOS of also 2 (compared to 3 for a similar system, but with a difference voice, entered in 2007).

Research paper thumbnail of THE CEREVOICE SPEECH SYNTHESISER

This paper describes the CereVoice ® text-to-speech system developed by Cereproc Ltd, and its use... more This paper describes the CereVoice ® text-to-speech system developed by Cereproc Ltd, and its use for the generation of the test sentences for the Albayzin 2008 TTS evaluation. Also, the building procedure of a Cerevoice-compatible voice for the Albayzin 2008 evaluation using the provided database and the Cerevoice VCK, a Cereproc tool for fast and fully automated creation of voices, is described.

Research paper thumbnail of Single Speaker Segmentation and Inventory Selection Using Dynamic Time Warping Self Organization and Joint Multigram Mapping

In speech synthesis the inventory of units is decided by inspection and on the basis of phonologi... more In speech synthesis the inventory of units is decided by inspection and on the basis of phonological and phonetic expertise. The ephone (or emergent phone) project at CSTR is investigating how self organisation techniques can be applied to build an inventory based on collected acoustic data together with the constraints of a synthesis lexicon. In this paper we will describe a prototype inventory creation method using dynamic time warping (DTW) for acoustic clustering and a joint multigram approach for relating a series of symbols that represent the speech to these emerged units. We initially examined two symbol sets: 1) A baseline of standard phones 2) Orthographic symbols. The success of the approach is evaluated by comparing word boundaries generated by the emergent phones against those created using state-of-the-art HMM segmentation. Initial results suggest the DTW segmentation can match word boundaries with a root mean square error (RMSE) of 35ms. Results from mapping units onto phones resulted in a higher RMSE of 103ms. This error was increased when multiple multigram types were added and when the default unit clustering was altered from 40 (our baseline) to 10. Results for orthographic matching had a higher RMSE of 125ms. To conclude we discuss future work that we believe can reduce this error rate to a level sufficient for the techniques to be applied to a unit selection synthesis system.

Research paper thumbnail of Speech synthesis without a phone inventory

In speech synthesis the unit inventory is decided using phonological and phonetic expertise. This... more In speech synthesis the unit inventory is decided using phonological and phonetic expertise. This process is resource intensive and potentially sub-optimal. In this paper we investigate how acoustic clustering, together with lexicon constraints, can be used to build a self-organised inventory. Six English speech synthesis systems were built using two frameworks, unit selection and parametric HTS for three inventory conditions: 1) a traditional phone set, 2) a system using orthographic units, and 3) a self-organised inventory. A listening test showed a strong preference for the classic system, and for the orthographic system over the self-organised system. Results also varied by letter to sound complexity and database coverage. This suggests the self-organised approach failed to generalise pronunciation as well as introducing noise above and beyond that caused by orthographic sound mismatch.

Research paper thumbnail of The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate

Speech Communication, 2011

This paper first introduces a newly-recorded high quality Romanian speech corpus designed for spe... more This paper first introduces a newly-recorded high quality Romanian speech corpus designed for speech synthesis, called "RSS", along with Romanian front-end text processing modules and HMM-based synthetic voices built from the corpus. All of these are now freely available for academic use in order to promote Romanian speech technology research. The RSS corpus comprises 3500 training sentences and 500 test sentences uttered by a female speaker and was recorded using multiple microphones at 96kHz sampling frequency in a hemianechoic chamber. The details of the new Romanian text processor we have developed are also given.

Research paper thumbnail of Combining Statistical Parameteric Speech Synthesis and Unit-Selection for Automatic Voice Cloning

The ability to use the recorded audio of a subject's voice to produce an open-domain synthesis sy... more The ability to use the recorded audio of a subject's voice to produce an open-domain synthesis system has generated much interest both in academic research and in commercial speech technology. The ability to produce synthetic versions of a subjects voice has potential commercial applications, such as virtual celebrity actors, or potential clinical applications, such as offering a synthetic replacement voice in the case of a laryngectomy. Recent developments in HMM-based speech synthesis have shown it is possible to produce synthetic voices from quite small amounts of speech data. However, mimicking the depth and variation of a speaker's prosody as well as synthesising natural voice quality is still a challenging research problem. In contrast, unit-selection systems have shown it is possible to strongly retain the character of the voice but only with sufficient original source material. Often this runs into hours and may require significant manual checking and labelling.

Research paper thumbnail of A statistically motivated database pruning technique for unit selection synthesis

Research paper thumbnail of My voice, your prosody: sharing a speaker specific prosody model across speakers in unit selection TTS

Data sparsity is a major problem for data driven prosodic models. Being able to share prosodic da... more Data sparsity is a major problem for data driven prosodic models. Being able to share prosodic data across speakers is a potential solution to this problem. This paper explores this potential solution by addressing two questions: 1) Does a larger less sparse model from a different speaker produce more natural speech than a small sparse model built from the original speaker? 2)Does a different speaker's larger model generate more unit selection errors than a small sparse model built from the original speaker?

Research paper thumbnail of Vowel quality in spontaneous speech: what makes a good vowel

Clear speech is characterised by longer segmental durations and less target undershoot [9] which ... more Clear speech is characterised by longer segmental durations and less target undershoot [9] which results in more extreme spectral features. This paper deals with the clarity of vowels produced in spontaneous speech in a large corpus of task-oriented dialogues. We present an ...

Research paper thumbnail of Language redundancy predicts syllabic duration and the spectral characteristics of vocalic syllable nuclei

Journal of The Acoustical Society of America, 2006

The language redundancy of a syllable, measured by its predictability given its context and inher... more The language redundancy of a syllable, measured by its predictability given its context and inherent frequency, has been shown to have a strong inverse relationship with syllabic duration. This relationship is predicted by the hypothesis that an inverse relationship between language redundancy and the predictability given acoustic observations, the acoustic redundancy, makes speech more robust in a noisy environment (The smooth signal redundancy hypothesis). This hypothesis also predicts a similar relationship between the spectral characteristics of speech and language redundancy. However, the investigation of such a relationship is hampered by difficulties in measuring the spectral characteristics of speech within large conversational corpora, and difficulties in forming models of acoustic redundancy based on these spectral characteristics. This paper addresses these difficulties by testing the smooth signal redundancy hypothesis with a very high quality corpora collected for speech synthesis, and presents both durational and spectral data from vowel nuclei on a vowel by vowel basis. Results confirm the duration/ language redundancy results achieved in previous work, and show a significant relationship between language redundancy factors and F1/F2 formants. The results vary considerably by vowel. In general, vowels show increased centralization with increased language redundancy.

Co-authored papers by Matthew Aylett

Research paper thumbnail of Multilevel auditory displays for mobile eyes-free location-based interaction

Proceeding CHI EA '14 CHI '14 Extended Abstracts on Human Factors in Computing Systems (pp. 1567-1572) , 2014

Research paper thumbnail of The CereVoice Characterful Speech Synthesiser SDK

CereProc® Ltd. have recently released a beta version of a commercial unit selection synthesiser f... more CereProc® Ltd. have recently released a beta version of a commercial unit selection synthesiser featuring XML control of speech style. The system is freely available for academic use and allows fine control of the rendered speech as well as full timings to interface with avatars and other animation.

Research paper thumbnail of The Cerevoice Blizzard Entry 2006: A Prototype Small Database Unit Selection Engine

Cerevoice R is a unit selection speech synthesis system produced by Cereproc Ltd. The system was ... more Cerevoice R is a unit selection speech synthesis system produced by Cereproc Ltd. The system was used to build small and large unit selection databases using the data supplied by the Blizzard Challenge 2006. The large database system was used as a baseline system while two experimental approaches for improving the quality of the small database system were explored. 1) Synthetically generating diphones from sections of diphones in the database offline and then using them in synthesis, a process we term bulking. 2) Applying limited manual intervention based on negative feedback to improve quality, a process we term second-pass synthesis. Both techniques resulted in the small database system maintaining the quality of the larger system. We conclude that there is much room for improvement in the quality of small database systems using unit selection without the requirement for more data and that second-pass synthesis offers a potential means for training small database unit selection systems.

Research paper thumbnail of The Cerevoice Blizzard Entry 2007: Are Small Database Errors Worse than Compression Artifacts

In commercial systems the memory footprint of unit selection systems is often a key issue. This i... more In commercial systems the memory footprint of unit selection systems is often a key issue. This is especially true for PDAs and other embedded devices. In this years Blizzard entry CereProc R gave itself the criteria that the full database system entered would have a smaller memory footprint than either of the two smaller database entries. This was accomplished by applying speex speech compression to the full database entry. In turn a set of small database techniques used to improve the quality of small database systems in last years entry were extended. Finally, for all systems, two quality control methods were applied to the underlying database to improve the lexicon and transcription match to the underlying data.

Research paper thumbnail of The CereProc Blizzard Entry 2009: Some dumb algorithms that don't work

Within unit selection systems there is a constant tension between data sparsity and quality. This... more Within unit selection systems there is a constant tension between data sparsity and quality. This limits the control possible in a unit selection system. The RP data used in Blizzard this year and last year is expressive and spoken in a spirited manner. Last years entry focused on maintaining expressiveness, this year we focused on two simple algorithms to restrain and control this prosodic variation. 1) Variable width valley floor pruning on duration and pitch (Applied to the full database entry EH1), 2) Bulking of data with average HTS data (Applied to small database entry EH2). Results for both techniques were disappointing. The full database system achieved an MOS of around 2 (compared to 4 for a similar system attempting to emphasise variation in 2008), while the small database entry achieved an MOS of also 2 (compared to 3 for a similar system, but with a difference voice, entered in 2007).

Research paper thumbnail of THE CEREVOICE SPEECH SYNTHESISER

This paper describes the CereVoice ® text-to-speech system developed by Cereproc Ltd, and its use... more This paper describes the CereVoice ® text-to-speech system developed by Cereproc Ltd, and its use for the generation of the test sentences for the Albayzin 2008 TTS evaluation. Also, the building procedure of a Cerevoice-compatible voice for the Albayzin 2008 evaluation using the provided database and the Cerevoice VCK, a Cereproc tool for fast and fully automated creation of voices, is described.

Research paper thumbnail of Single Speaker Segmentation and Inventory Selection Using Dynamic Time Warping Self Organization and Joint Multigram Mapping

In speech synthesis the inventory of units is decided by inspection and on the basis of phonologi... more In speech synthesis the inventory of units is decided by inspection and on the basis of phonological and phonetic expertise. The ephone (or emergent phone) project at CSTR is investigating how self organisation techniques can be applied to build an inventory based on collected acoustic data together with the constraints of a synthesis lexicon. In this paper we will describe a prototype inventory creation method using dynamic time warping (DTW) for acoustic clustering and a joint multigram approach for relating a series of symbols that represent the speech to these emerged units. We initially examined two symbol sets: 1) A baseline of standard phones 2) Orthographic symbols. The success of the approach is evaluated by comparing word boundaries generated by the emergent phones against those created using state-of-the-art HMM segmentation. Initial results suggest the DTW segmentation can match word boundaries with a root mean square error (RMSE) of 35ms. Results from mapping units onto phones resulted in a higher RMSE of 103ms. This error was increased when multiple multigram types were added and when the default unit clustering was altered from 40 (our baseline) to 10. Results for orthographic matching had a higher RMSE of 125ms. To conclude we discuss future work that we believe can reduce this error rate to a level sufficient for the techniques to be applied to a unit selection synthesis system.

Research paper thumbnail of Speech synthesis without a phone inventory

In speech synthesis the unit inventory is decided using phonological and phonetic expertise. This... more In speech synthesis the unit inventory is decided using phonological and phonetic expertise. This process is resource intensive and potentially sub-optimal. In this paper we investigate how acoustic clustering, together with lexicon constraints, can be used to build a self-organised inventory. Six English speech synthesis systems were built using two frameworks, unit selection and parametric HTS for three inventory conditions: 1) a traditional phone set, 2) a system using orthographic units, and 3) a self-organised inventory. A listening test showed a strong preference for the classic system, and for the orthographic system over the self-organised system. Results also varied by letter to sound complexity and database coverage. This suggests the self-organised approach failed to generalise pronunciation as well as introducing noise above and beyond that caused by orthographic sound mismatch.

Research paper thumbnail of The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate

Speech Communication, 2011

This paper first introduces a newly-recorded high quality Romanian speech corpus designed for spe... more This paper first introduces a newly-recorded high quality Romanian speech corpus designed for speech synthesis, called "RSS", along with Romanian front-end text processing modules and HMM-based synthetic voices built from the corpus. All of these are now freely available for academic use in order to promote Romanian speech technology research. The RSS corpus comprises 3500 training sentences and 500 test sentences uttered by a female speaker and was recorded using multiple microphones at 96kHz sampling frequency in a hemianechoic chamber. The details of the new Romanian text processor we have developed are also given.

Research paper thumbnail of Combining Statistical Parameteric Speech Synthesis and Unit-Selection for Automatic Voice Cloning

The ability to use the recorded audio of a subject's voice to produce an open-domain synthesis sy... more The ability to use the recorded audio of a subject's voice to produce an open-domain synthesis system has generated much interest both in academic research and in commercial speech technology. The ability to produce synthetic versions of a subjects voice has potential commercial applications, such as virtual celebrity actors, or potential clinical applications, such as offering a synthetic replacement voice in the case of a laryngectomy. Recent developments in HMM-based speech synthesis have shown it is possible to produce synthetic voices from quite small amounts of speech data. However, mimicking the depth and variation of a speaker's prosody as well as synthesising natural voice quality is still a challenging research problem. In contrast, unit-selection systems have shown it is possible to strongly retain the character of the voice but only with sufficient original source material. Often this runs into hours and may require significant manual checking and labelling.

Research paper thumbnail of A statistically motivated database pruning technique for unit selection synthesis

Research paper thumbnail of My voice, your prosody: sharing a speaker specific prosody model across speakers in unit selection TTS

Data sparsity is a major problem for data driven prosodic models. Being able to share prosodic da... more Data sparsity is a major problem for data driven prosodic models. Being able to share prosodic data across speakers is a potential solution to this problem. This paper explores this potential solution by addressing two questions: 1) Does a larger less sparse model from a different speaker produce more natural speech than a small sparse model built from the original speaker? 2)Does a different speaker's larger model generate more unit selection errors than a small sparse model built from the original speaker?

Research paper thumbnail of Vowel quality in spontaneous speech: what makes a good vowel

Clear speech is characterised by longer segmental durations and less target undershoot [9] which ... more Clear speech is characterised by longer segmental durations and less target undershoot [9] which results in more extreme spectral features. This paper deals with the clarity of vowels produced in spontaneous speech in a large corpus of task-oriented dialogues. We present an ...

Research paper thumbnail of Language redundancy predicts syllabic duration and the spectral characteristics of vocalic syllable nuclei

Journal of The Acoustical Society of America, 2006

The language redundancy of a syllable, measured by its predictability given its context and inher... more The language redundancy of a syllable, measured by its predictability given its context and inherent frequency, has been shown to have a strong inverse relationship with syllabic duration. This relationship is predicted by the hypothesis that an inverse relationship between language redundancy and the predictability given acoustic observations, the acoustic redundancy, makes speech more robust in a noisy environment (The smooth signal redundancy hypothesis). This hypothesis also predicts a similar relationship between the spectral characteristics of speech and language redundancy. However, the investigation of such a relationship is hampered by difficulties in measuring the spectral characteristics of speech within large conversational corpora, and difficulties in forming models of acoustic redundancy based on these spectral characteristics. This paper addresses these difficulties by testing the smooth signal redundancy hypothesis with a very high quality corpora collected for speech synthesis, and presents both durational and spectral data from vowel nuclei on a vowel by vowel basis. Results confirm the duration/ language redundancy results achieved in previous work, and show a significant relationship between language redundancy factors and F1/F2 formants. The results vary considerably by vowel. In general, vowels show increased centralization with increased language redundancy.

Research paper thumbnail of Multilevel auditory displays for mobile eyes-free location-based interaction

Proceeding CHI EA '14 CHI '14 Extended Abstracts on Human Factors in Computing Systems (pp. 1567-1572) , 2014