A flexible, scalable finite-state transducer architecture for corpus-based concatenative speech synthesis (original) (raw)

Time and space-efficient architecture for a corpus-based text-to-speech synthesis system

Speech Communication, 2007

This paper proposes a time and space-efficient architecture for a text-to-speech synthesis system (TTS). The proposed architecture can be efficiently used in those applications with unlimited domain, requiring multilingual or polyglot functionality. The integration of a queuing mechanism, heterogeneous graphs and finite-state machines gives a powerful, reliable and easily maintainable architecture for the TTS system. Flexible and language-independent framework efficiently integrates all those algorithms used within the scope of the TTS system. Heterogeneous relation graphs are used for linguistic information representation and feature construction. Finite-state machines are used for time and space-efficient representation of language resources, for time and space-efficient lookup processes, and the separation of language-dependent resources from a language-independent TTS engine. Its queuing mechanism consists of several dequeue data structures and is responsible for the activation of all those TTS engine modules having to process the input text. In the proposed architecture, all modules use the same data structure for gathering linguistic information about input text. All input and output formats are compatible, the structure is modular and interchangeable, it is easily maintainable and object oriented. The proposed architecture was successfully used when implementing the Slovenian PLATTOS corpus-based TTS system, as presented in this paper.

A Flexible Rule Compiler for Speech Synthesis

2004

We present a flexible rule compiler developed for a text-to-speech (TTS) system. The compiler converts a set of rules into a finite-state transducer (FST). The input and output of the FST are subject to parameterization, so that the system can be applied to strings and sequences of feature-structures. The resulting transducer is guaranteed to realize a function (as opposed to a relation), and therefore can be implemented as a deterministic device (either a deterministic FST or a bimachine). 1 Motivation Implementations of TTS systems are often based on operations transforming one sequence of symbols or objects into another. Starting from the input string, the system creates a sequence of tokens which are subject to part-of-speech tagging, homograph disambiguation rules, lexical lookup and grapheme-to-phoneme conversion. The resulting phonetic transcriptions are also transformed by syllabification rules, post-lexical reductions, etc. The character of the above transformations suggests finite-state transducers (FSTs) as a modelling framework [Sproat, 1996,Mohri, 1997]. However, this is not always straightforward for two reasons. Firstly, the transformations are more often expressed by rules than encoded directly in finite-state networks. In order to overcome this difficulty, we need an adequate compiler converting the rules into an FST. Secondly, finite-state machines require a finite alphabet of symbols while it is often more adequate to encode linguistic information using structured representations (e.g. feature structures) the inventory of which might be potentially infinite. Thus, the compilation method must be able to reduce the inifinite set of feature structures to a finite FST input alphabet. In this paper, we show how these two problems have been solved in rVoice, a speech synthesis system developed at Rhetorical Systems. 2 Definitions and Notation A deterministic finite-state automaton (acceptor, DFSA) over a finite alphabet Σ is a quintuple A = (Σ, Q, q 0 , δ, F) such that: Q is a finite set of states, and q 0 ∈ Q is the initial state of A; δ : Q × Σ → Q is the transition function of A; F ⊂ Q is a non-empty set of final states.

An Efficient Unit-Selection Method for Concatenative Text-to-Speech Synthesis Systems

Journal of Computing and Information Technology, 2007

This paper presents a method for selecting speech units for polyphone concatenative speech synthesis, in which the simplification of procedures for search paths in a graph has accelerated the speed of the unit-selection procedure with minimum effects on the speech quality. The speech units selected are still optimal; only the costs of merging the units on which the selection is based are less accurately determined. Due to its low processing power and memory footprint requirements, the method is suitable for use in embedded speech synthesizers.

Joint prosody prediction and unit selection for concatenative speech synthesis

2001

In this paper we describe how prosody prediction can be efficiently integrated with the unit selection process in a concatenative speech synthesizer under a weighted finite-state transducer (WFST) architecture. WFSTs representing prosody prediction and unit selection can be composed during synthesis, thus effectively expanding the space of possible prosodic targets. We implemented a symbolic prosody prediction module and a unit selection database as the synthesis components of a travel planning system. Results of perceptual experiments show that by combining the steps of prosody prediction and unit selection we are able to achieve improved naturalness of synthetic speech compared to the sequential implementation.

Reducing the footprint of the IBM trainable speech synthesis system

… on Spoken Language …, 2002

This paper presents a novel approach for concatenative speech synthesis. This approach enables reduction of the dataset size of a concatenative text-to-speech system, namely the IBM trainable speech synthesis system, by more than an order of magnitude. A spectral ...

The NTNU Concatenative Speech Synthesizer

2010

This paper describes NTNU’s entry for the Blizzard Challenge 2010. Our system is a conceptually simple variation of an HMM-based unit selection system, which uses diphones as the basic unit and employs a combined selection of units and their join points. The evaluation results of the Blizzard Challenge 2010 show that the system performs well when compared with the other systems.

THE ARCHITECTURE OF THE FESTIVAL SPEECH SYNTHESIS SYSTEM

1998

We describe a new formalism for storing linguistic data in a text to speech system. Linguistic entities such as words and phones are stored as feature structures in a general object called an linguistic item. Items are configurable at run time and via the feature structure can contain arbitrary information. Linguistic relations are used to store the relationship between items of the same linguistic type. Relations can take any graph structure but are commonly trees or lists. Utterance structures contain all the items and relations contained in a single utterance. We first describe the design goals when building a synthesis architecture, and then describe some problems with previous architectures. We then discuss our new formalism in general along with the implementation details and consequences of our approach.

Joint Prosody Prediction and Unit Selection for Concatenative Speech Synthesis Using Large Database

In this paper we describe how prosody prediction can be efficiently integrated with the unit selection process in a concatenative speech synthesizer under a weighted finite-state transducer (WFST) architecture. WFSTs representing prosody prediction and unit selection can be composed during synthesis, thus effectively expanding the space of possible prosodic targets. We implemented a symbolic prosody prediction module and a unit selection database as the synthesis components of a travel planning system. Results of perceptual experiments show that by combining the steps of prosody prediction and unit selection we are able to achieve improved naturalness of synthetic speech compared to the sequential implementation.

What next? Continuation in Real-time corpus-based concatenative synthesis

2008

We propose an extension to real-time corpus-based concatenative synthesis that predicts the best sound unit to follow an arbitrary sequence of units depending on context. This novel method is well suited to interactive applications because it does not need a preliminary analysis or training phase. We experiment different modes of interaction and present results with a quantitative evaluation of the influence of the new method on corpora of drum loops, voice, and environmental sounds.