AMISCO: The Austrian German Multi-Sensor Corpus (original) (raw)
Related papers
EURASIP Journal on Advances in Signal Processing, 2002
Strides in computer technology and the search for deeper, more powerful techniques in signal processing have brought multimodal research to the forefront in recent years. Audio-visual speech processing has become an important part of this research because it holds great potential for overcoming certain problems of traditional audio-only methods. Difficulties due to background noise and multiple speakers in an application environment are significantly reduced by the additional information provided by visual features. This paper presents information on a new audio-visual database, a feature study on moving speakers, and baseline results for the whole speaker group.
Computer Speech & …, 2011
We present an overview of the data collection and transcription efforts for the COnversational Speech In Noisy Environments (COSINE) corpus. The corpus is a set of multi-party conversations recorded in real world environments, with background noise, that can be used to train noise-robust speech recognition systems or develop speech de-noising algorithms. We explain the motivation for creating such a corpus, and describe the resulting audio recordings and transcriptions that comprise the corpus. These high quality recordings were captured in-situ on a custom wearable recording system, whose design and construction is also described. On separate synchronized audio channels, seven-channel audio is captured with a 4-channel far-field microphone array, along with a close-talking, a monophonic far-field, and a throat microphone. This corpus thus creates many possibilities for speech algorithm research. on audio with noise conditions that are matched to the audio being recognized is likely to be an upper bound on the performance of model compensation schemes (acoustic model adaptation) [1]. The use of training audio that exhibits the Lombard effect has also been shown to improve the performance of speech recognition systems .
A basic requirement for improvements in distant speech recognition is the availability of a respective corpus of recorded utterances. With microphone arrays now available off-the-shelf as part of the Microsoft Kinect, a common recording device for such a corpus is wide-spread. In this paper, we introduce KiSRecord, an open source recording tool that can be used to alleviate this situation for data collection. Thus, we provide the first steps towards a community sourcing effort for a speech corpus recorded with de-facto standard microphone arrays.
In this paper the FAU IISAH corpus and its recording conditions are described: a new speech database consisting of human-machine and human-human interaction recordings. Beside close-talking microphones for the best possible audio quality of the recorded speech, far-distance microphones were used to acquire the interaction and communication. The recordings took place during a Wizard-of-Oz experiment in the intelligent, senior-adapted house (ISA-House). That is a living room with a speech controlled home assistance system for elderly people, based on a dialogue system, which is able to process spontaneous speech. During the studies in the ISA-House more than eight hours of interaction data were recorded including 3 hours and 27 minutes of spontaneous speech. The data were annotated under the aspect of human-human (off-talk) and human-machine (on-talk) interaction.
RESYNTHESIZING THE GECO SPEECH CORPUS WITH VOCALTRACTLAB
Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2019, 2019
Sering, K., N. Stehwien, Y. Gao, M. V. Butz, and R. H. Baayen We are addressing the challenge of learning an inverse mapping between acoustic features and control parameters of a vocal tract simulator. As a first step, we synthesize an articulatory corpus consisting of control parameters and wave forms using VocalTractLab (VTL; [1]) as the vocal tract simulator. The basis for the synthesis is a concatenative approach that combines gestures of VTL according to a SAMPA transcription. SAMPA transcriptions are taken from the GECO corpus [2], a spontaneous speech corpus of southern German. The presented approach uses the duration of the phones and extracted pitch contours to create gesture files for the VTL. The resynthesis of the GECO corpus results in 53960 valid spliced out word samples totalling in 6 hours and 23 minutes of synthesized speech. The synthesis quality is mediocre. We believe that the synthesized samples resemble some of the natural variability found in natural human speech. 1 Motivation Constructing an articulatory corpus benefits many research fields, including automatic speech recognition [3], speech synthesis [4], acoustic-to-articulatory inversion [5] et al. There exist some articulatory corpora, such as Wisconsin X-ray microbeam database (XRMB) [6], MOCHA-TIMIT [7], MRI-TIMIT [8], which were successfully employed in above research fields. In the present paper, we aim at constructing an articulatory corpus using a vocal tract simulator as well as corresponding synthesized speech signals upon a spontaneous German speech corpus. Compared to hardware-based recorded corpora, it is not labor intensive and noninvasive to speakers. Moreover, unlike articulatory information of recorded images or limited measure points, it provides with rich representation of articulation process quantified by 30 control parameters and at a resolution of 10 milliseconds. These parameters can in turn be used to control articulatory synthesis. Coming up with the control parameters for the vocal tract simulator is not an easy task. The two most prominent approaches to approximate the parameters that control a vocal tract simulator is firstly, to give the articulators in the vocal tract simulation different targets at different points in time and interpolate between these targets cleverly or secondly, define a set of gestures that define the trajectory of a subset of the articulators for a time interval. Using a gestural approach and allowing for gesture overlaps demands a rule to mix gestures. We believe that both of these approaches capture some of the structures that we see in human articulations but cannot account for the wide range of different articulation that is present in everyday natural speech. We therefore seek to replace a rule based target or gesture approach that composes a small number of targets or gestures in a smart concatenative way, by modeling the structure of the whole trajectory in a more direct data driven way. One approach to generate trajectories without defining targets or gestures is to find a mapping between acoustic features and control parameter 95
GestSync: Determining who is speaking without a talking head
arXiv (Cornell University), 2023
In this paper we introduce a new synchronisation task, Gesture-Sync: determining if a person's gestures are correlated with their speech or not. In comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far looser relationship between the voice and body movement than there is between voice and lip motion. We introduce a dual-encoder model for this task, and compare a number of input representations including RGB frames, keypoint images, and keypoint vectors, assessing their performance and advantages. We show that the model can be trained using self-supervised learning alone, and evaluate its performance on the LRS3 dataset. Finally, we demonstrate applications of Gesture-Sync for audiovisual synchronisation, and in determining who is the speaker in a crowd, without seeing their faces. The code, datasets and pre-trained models can be found at: https://www.robots.ox.ac.uk/\~vgg/research/gestsync. Figure 1: Who is speaking in these scenes? Our model, dubbed GestSync learns to identify whether a person's gestures and speech are "in-sync". The learned embeddings from our model are used to determine "who is speaking" in the crowd, without looking at their faces. Please refer to the demo video for examples.
Multimodal Corpus of Speech Production: Work in Progress
The paper introduces work-in-progress on multimodal articulatory data collection involving multiple instrumental techniques such as electrolaryngography (EGG), electropalatography (EPG) and electromagnetic articulography (EMA). The data is recorded from two native Estonian speakers (one male and one female), the target amount of the corpus is approximately one hour of speech from both subjects. In the paper the instrumental systems exploited for data collection and recording set-ups are introduced, examples of multimodal data analysis are given and the possible use of the corpus is discussed.
GRASS: The Graz corpus of read and spontaneous speech
This paper provides a description of the preparation, the speakers, the recordings, and the creation of the orthographic transcriptions of the first large scale speech database for Austrian German. It contains approximately 1900 minutes of (read and spontaneous) speech produced by 38 speakers. The corpus consists of three components. First, the Conversation Speech (CS) component contains free conversations of one hour length between friends, colleagues, couples, or family members. Second, the Commands Component (CC) contains commands and keywords which were either read or elicited by pictures. Third, the Read Speech (RS) component contains phonetically balanced sentences and digits. The speech of all components has been recorded at super-wideband quality in a soundproof recording-studio with head-mounted microphones, large-diaphragm microphones, a laryngograph, and with a video camera. The orthographic transcriptions, which have been created and subsequently corrected manually, contain approximately 290 000 word tokens from 15 000 different word types.
TRAP-TANDEM: data-driven extraction of temporal features from speech
2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721)
Conventional features in automatic recognition of speech describe instantaneous shape of a short-term spectrum of speech. The TRAP-TANDEM features describe likelihood of sub-word classes at a given time instant, derived from temporal trajectories of band-limited spectral densities in the vicinity of the given instant. The paper presents some rationale behind the data-driven TRAP-TANDEM approach, briefly describes the technique, points to relevant publications and summarizes results achieved so far.
Acoustic feature comparison for different speaking rates
Springer, Cham - International Conference on Human-Computer Interaction, LNCS, 2018
This paper investigates the effect of speaking rate variation on the task of frame classification. This task is indicative of the performance on phoneme and word recognition and is a first step towards designing voice-controlled interfaces. Different speaking rates cause different dynamics. For example, speaking rate variations will cause changes both in formant frequencies and in their transition tracks. A word spoken at normal speed gets recognized more often than the same word spoken by the same speaker at a much faster or slower pace, or vice-versa. It is thus imperative to design interfaces which take into account different speaking variabilities. To better incorporate speaker variability into digital devices, we study the effect of a) feature selection and b) the choice of network architecture on variable speaking rates. Four different features are evaluated on multiple configurations of Deep Neural Network (DNN) architectures. The findings show that log Filter-Bank Energies (FBE) outperformed the other acoustic features not only on normal speaking rate but for slow and fast speaking rates as well.