RESYNTHESIZING THE GECO SPEECH CORPUS WITH VOCALTRACTLAB (original) (raw)

2019, Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2019

Sering, K., N. Stehwien, Y. Gao, M. V. Butz, and R. H. Baayen We are addressing the challenge of learning an inverse mapping between acoustic features and control parameters of a vocal tract simulator. As a first step, we synthesize an articulatory corpus consisting of control parameters and wave forms using VocalTractLab (VTL; [1]) as the vocal tract simulator. The basis for the synthesis is a concatenative approach that combines gestures of VTL according to a SAMPA transcription. SAMPA transcriptions are taken from the GECO corpus [2], a spontaneous speech corpus of southern German. The presented approach uses the duration of the phones and extracted pitch contours to create gesture files for the VTL. The resynthesis of the GECO corpus results in 53960 valid spliced out word samples totalling in 6 hours and 23 minutes of synthesized speech. The synthesis quality is mediocre. We believe that the synthesized samples resemble some of the natural variability found in natural human speech. 1 Motivation Constructing an articulatory corpus benefits many research fields, including automatic speech recognition [3], speech synthesis [4], acoustic-to-articulatory inversion [5] et al. There exist some articulatory corpora, such as Wisconsin X-ray microbeam database (XRMB) [6], MOCHA-TIMIT [7], MRI-TIMIT [8], which were successfully employed in above research fields. In the present paper, we aim at constructing an articulatory corpus using a vocal tract simulator as well as corresponding synthesized speech signals upon a spontaneous German speech corpus. Compared to hardware-based recorded corpora, it is not labor intensive and noninvasive to speakers. Moreover, unlike articulatory information of recorded images or limited measure points, it provides with rich representation of articulation process quantified by 30 control parameters and at a resolution of 10 milliseconds. These parameters can in turn be used to control articulatory synthesis. Coming up with the control parameters for the vocal tract simulator is not an easy task. The two most prominent approaches to approximate the parameters that control a vocal tract simulator is firstly, to give the articulators in the vocal tract simulation different targets at different points in time and interpolate between these targets cleverly or secondly, define a set of gestures that define the trajectory of a subset of the articulators for a time interval. Using a gestural approach and allowing for gesture overlaps demands a rule to mix gestures. We believe that both of these approaches capture some of the structures that we see in human articulations but cannot account for the wide range of different articulation that is present in everyday natural speech. We therefore seek to replace a rule based target or gesture approach that composes a small number of targets or gestures in a smart concatenative way, by modeling the structure of the whole trajectory in a more direct data driven way. One approach to generate trajectories without defining targets or gestures is to find a mapping between acoustic features and control parameter 95