Jean-pierre Martens | Ghent University (original) (raw)

Papers by Jean-pierre Martens

2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004

Recently there has been an increasing amount of work in the area of automatic genre classificatio... more Recently there has been an increasing amount of work in the area of automatic genre classification of music in audio format. Such systems can be used as a way to evaluate features describing musical content as well as a way to structure large collections of music. However the evaluation and comparison of genre classification systems is hindered by the subjective perception of genre definitions by users. In this work we describe a set of experiments in automatic musical genre classification. An important contribution of this work is the comparison of the automatic results with human genre classification on the same dataset. The results show that, although there is significant room for improvement, genre classification is inherently subjective and therefore perfect results can not be expected from either automatic algorithms or human annotation. The experiments also show that the use of features derived from an auditory model have similar performance with features based on Mel-Frequency Cepstral Coefficients (MFCC).

International Conference on Acoustics, Speech, and Signal Processing, 1990

An auditory model incorporating critical band filtering, short time adaptation, and temporal anal... more An auditory model incorporating critical band filtering, short time adaptation, and temporal analysis of the auditory nerve responses is presented. Unlike previously proposed synchrony models, this model emphasizes the instantaneous amplitude (i.e. the envelope) of the neural responses rather than the instantaneous frequency as the carrier of perceptually relevant information. It is demonstrated how the auditory speech representation can be

2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721), 2003

Nowadays, automatic speech recognizers have become quite good in recognizing well prepared fluent... more Nowadays, automatic speech recognizers have become quite good in recognizing well prepared fluent speech (e.g. news readings). However, the recognition of unprepared or spontaneous speech is still problematic. Some important reasons for this are that spontaneous speech is less articulated, exhibits a high speaking rate and usually contains a lot of disfluencies. The latter occur when the speaker needs time to think about the continuation of his discourse, or when he needs to change/correct his last utterance. Although there are different types of disfluencies (interruptions, corrections, repetitions, etc.) the most common ones are filled pauses. They can take the form of an interjection like /uh/ or /uhm/, or an abnormal lengthening of one syllable of a word. In this paper we propose a new method for detecting such fillers prior to the speech recognition. Tests show that it is possible to improve the recognition accuracy by just removing the detected filled pauses from the recognizer input.

2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings, 2006

It is acknowledged that in many medical and educational applications there is a great need for go... more It is acknowledged that in many medical and educational applications there is a great need for good objective assessments of the pronunciation proficiency of a speaker, either a non-native speaker of the language or a native speaker with a certain speech handicap (e.g. a deaf or dysarthric speaker). Most pronunciation scoring software developed thus far just measures an over-all proficiency. The system proposed here envisages the computation of more detailed information on the nature of the pronunciation deficiencies. To that end, it works with a phonological representation of the speech. Described is the development of the system as well as its first encouraging assessments of non-native speakers of American English

Most automatic speech recognition systems employ Hidden Markov Models with Gaussian mixture emiss... more Most automatic speech recognition systems employ Hidden Markov Models with Gaussian mixture emission distributions to model the acoustics. There have been several attempts however to challenge this approach, e.g. by introducing a neural network (NN) as an alternative acoustic model. Although the performance of these so-called hybrid systems is actually quite good, their training is often problematic and time consuming. By using a reservoir -this is a recurrent NN with only the output weights being trainable -we can overcome this disadvantage and yet obtain good accuracy. In this paper, we propose the first reservoir-based connected digit recognition system, and we demonstrate good performance on the Aurora-2 testbed. Since RC is a new technology, we anticipate that our present system is still sub-optimal, and further improvements are possible.

Nowadays, automatic speech recognizers have become quite good in recognizing well prepared fluent... more Nowadays, automatic speech recognizers have become quite good in recognizing well prepared fluent speech (e.g. news readings). However, the recognition of spontaneous speech is still problematic. Some reasons for this are that spontaneous speech is usually less articulated and that it can contain a lot of disfluencies such as filled pauses (FPs), abbreviatons, repetitions, etc. In this paper, a new methodology for coping with FPs is presented. The basic idea is to detect FPs, and let this information control/modify the search for word hypotheses. Just counting normal words (excluding FPs), we can presently eliminate about one word error per FP occurring in the speech, and this without introducing a significant augmentation of the computational load.

A segment-based Dynamic Programming (DP) / Multi-Layer Perceptron (MLP) hybrid for speaker-indepe... more A segment-based Dynamic Programming (DP) / Multi-Layer Perceptron (MLP) hybrid for speaker-independent phone recognition is presented and evaluated. The system incorporates a segmentation stage, a broad phonetic classification MLP and a network of four context-independent phonetic classification MLP's for estimating a posteriori phone probabilities. The phonetic class information is then supplied to a DP phone recognition stage. A phone recognition accuracy of 58% (9.2% deletions, 6.6% insertions and 26.2% substitutions) is obtained with a system comprising less than 12000 free parameters and no phone language model. 1 Introduction We have developed a segment-based recognition strategy using Dynamic Programming and Multi-Layer Perceptrons. Segmenting speech prior to its phonetic classification has several advantages: phonetic cues which are encoded at specific times in the speech signal, can be fully utilized [10], and spectral/temporal relationships over the duration of a phone...

In this contribution, the design, collection, annotation and planned distribution of a new spoken... more In this contribution, the design, collection, annotation and planned distribution of a new spoken language resource of Afrikaans (SALAR) is discussed. The corpus contains speech of mother tongue speakers of Afrikaans, and is intended to become a primary national language resource for phonetic research and research on pronunciation variations. As such, the corpus is designed to expose pronunciation variations due to regional accents, speech rate (normal and fast speech) and speech mode (read and spontaneous speech). The corpus is collected by the Potchefstroom Campus of the North-West University, but in all phases of the corpus creation process there was a close collaboration with ELIS-UG (Belgium), one of the institutions that has been engaged in the creation of the Spoken Dutch Corpus (CGN).

Proceeding of the 16th ACM international conference on Multimedia - MM '08, 2008

It is common practice to map the frequency content of music onto a chroma representation, but the... more It is common practice to map the frequency content of music onto a chroma representation, but there exist many different schemes for constructing such a representation. In this paper, a new scheme is proposed. It comprises a detection of salient frequencies, a conversion of salient frequencies to notes, a psychophysically motivated weighting of harmonics in support of a note, a restriction of harmonic relations between different notes and a restriction of the deviations from a predefined pitch scale (e.g. the equally tempered western scale). A large-scale experimental evaluation has confirmed that the novel chroma representation more closely matches manual chord labels than the representations generated by six other tested schemes. Therefore, the new chroma representation is expected to improve applications such as song similarity matching and chord detection and labeling.

Journal of New Music Research, 2014

ABSTRACT In this paper, we present a probabilistic framework for the simultaneous estimation of c... more ABSTRACT In this paper, we present a probabilistic framework for the simultaneous estimation of chords and keys from audio. The framework is formulated in terms of acoustic models for both keys and chords, and a prior model that contains musicological knowledge about chords and keys. The latter consists of a compound of four components: a duration and a change model for both keys and chords. This division allows us to modify each of the components separately and to choose the most appropriate sources of knowledge for each of them. Furthermore, this makes it easier to interpret their role and their relevance in the estimation procedure. We compared multiple configurations of our system, increasing in complexity. This has permitted us to explore the relation between keys and chords, and the importance of integrating prior musicological knowledge into an automatic estimation system. It was found that chord estimation scores mostly depend on the integration of durational knowledge, while key estimation also requires prior information about the broader context.

Lecture Notes in Computer Science, 2003

ABSTRACT It has been experimentally demonstrated that optimized multi-stream based speech recogni... more ABSTRACT It has been experimentally demonstrated that optimized multi-stream based speech recognisers can perform substantially better than the corresponding conventional systems on some particular recognition tasks. Typically, those applications present critical robustness problems, with the speech signal affected by noise that is localised in the acoustic space. In general, substantial recognition improvement is obtained extracting multiple feature streams that encode highly complementary information related with the speech signal. The main goal of this experimental study is to assess the potential of the multi-stream statistical formalism on standard clean speech recognition tasks not particularly favourable to this approach and, adding to this, intentionally using highly correlated feature streams. Notwithstanding, it is here demonstrated that a careful design of the streams recombination model, adapting their local influence on the decoding process according to several information sources, can lead to significant performance gains comparing to the single-stream corresponding systems.

2011 First International Conference on Informatics and Computational Intelligence, 2011

It has been shown for some time that a Recurrent Neural Network (RNN) can perform an accurate aco... more It has been shown for some time that a Recurrent Neural Network (RNN) can perform an accurate acousticphonetic decoding of a continuous speech stream. However, the error back-propagation through time (EBPTT) training of such a network is often critical (bad local optimum) and very time consuming. These problems hamper the deployment of sufficiently large networks that would be able to outperform state-of-the-art Hidden Markov Models. To overcome this drawback of RNNs, we recently proposed to employ a large pool of recurrently connected non-linear nodes (a so-called reservoir) with fixed weights, and to map the reservoir outputs to meaningful phonemic classes by means of a layer of linear output nodes (called the readout nodes) whose weights form the solution of a set of linear equations. In this paper, we collect experimental evidence that the performance of a reservoirbased system can be enhanced by working with non-linear readout nodes. Although this calls for an iterative training, it boils down to a non-linear regression which seems to be less critical and time consuming than EBPTT.

2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), 2007

The development of an automatic speech recognizer (ASR) that can accurately recognize spoken name... more The development of an automatic speech recognizer (ASR) that can accurately recognize spoken names belonging to a large lexicon, is still a big challenge. One of the bottlenecks is that many names contain elements of a foreign language origin, and native speakers can adopt very different pronunciations of these elements, ranging from completely nativized to completely foreignized pronunciations. In this paper we further develop a recently proposed method for improving the recognition of foreign proper names spoken by native speakers. The main idea is to combine the standard acoustic model scores with scores emerging from a phonologically inspired back-off model that was trained on native speech only. This means that the proposed method does not require the development of any foreign phoneme models on foreign speech data. By applying our method on a baseline Dutch recognizer (comprising Dutch acoustic models) we could reduce the name error rate for French and English names by a considerable amount.

symposium.elis.ugent.be

... Johan Pauwels Supervisor(s): Jean-Pierre Martens ... Because a note usually contains harmon-i... more ... Johan Pauwels Supervisor(s): Jean-Pierre Martens ... Because a note usually contains harmon-ics of its fundamental frequency, one cannot simply add up the intensities of all frequen-cies belonging to the same pitch class to form a chroma profile. ...

In this paper a new pre-processor for a free speech transcription system is described. It perform... more In this paper a new pre-processor for a free speech transcription system is described. It performs a speech/non-speech partition, a segmentation of the speech parts into speaker turns, and a clus- teringofthespeakerturns. Itworksinastream-basedmode, and it is aiming for a high accuracy with a low delay and processing time. Experiments on the Hub4 Broadcast News corpus show thatthenewlyproposedpre-processoriscompetitivewithandin some respects better than the best systems published so far. The paper also describes attempts to raise the system performance by supplementing the standard MFCC features with prosodic features such as pitch and voicing evidence.

One of the bottlenecks in the development of text-to-speech synthesizers based on segment concate... more One of the bottlenecks in the development of text-to-speech synthesizers based on segment concatenation is the need for large, segmented and labeled corpora. Consequently, as manual segmentation and labeling is a tedious and time consuming task, there is a strong demand for automatic labeling systems which can label speech from many languages. Several systems have been proposed already, but they usually require hand labeled training utterances before they can be used for a new language.

Neural Processing Letters, 1996

ABSTRACT In this contribution, a new stochastically motivated random weight initialization scheme... more ABSTRACT In this contribution, a new stochastically motivated random weight initialization scheme for pattern classifying Multi-Layer Perceptrons (MLPs) is presented. Its first aim is to ensure that all training examples and all nodes have an equal opportunity to contribute to the improvement of the network during the Error Back-Propagation (EBP) training. In addition, it pursues input scale invariance: if the network inputs were substituted by rescaled inputs, the initialization procedure should provide an equally well performing network. Finally, the new algorithm can initialize MLPs comprising both concentric (e.g., Gaussian) and squashing (e.g., sigmoidal) nodes. Experiments demonstrate that networks initialized using the proposed method train better than networks initialized using a standard random initialization scheme.

Neural Networks, 1991

... in which the global number of clusters, as well as the number of dusters per class are determ... more ... in which the global number of clusters, as well as the number of dusters per class are determined auto-matically. ... 6.3.2. Traditional Back Propagation. ... to the learning parameters, al-N. W~'ymaere cmd J.-P. Marten.s TABLE 2 Results for the Broad Phonetic Classification (1 frame ...

This paper focuses on the specification of the orthographic transcription task in the Spoken Dutc... more This paper focuses on the specification of the orthographic transcription task in the Spoken Dutch Corpus, the problems encountered in making that specification and the evaluation experiments that were carried out to assess the transcription efficiency and the inter- transcriber consistency. It is stated that the role of the orthographic transcriptions in the Spoken Dutch Corpus is twofold: on the one hand, the transcriptions are important for future database users, on the other hand they are indispensable to the development of the corpus itself. The main objectives of the transcription task are the following: (1) obtain a verbatim transcription that can be made with a minimum level of interpretation of the utterances; (2) obtain an alignment of the transcription to the speech signal on the level of relatively short chunks; (3) obtain a transcription that is useful to researchers working in several research areas and (4) adhere to international standards for existing large speech co...

2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004

International Conference on Acoustics, Speech, and Signal Processing, 1990

2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721), 2003

2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings, 2006

Nowadays, automatic speech recognizers have become quite good in recognizing well prepared fluent... more Nowadays, automatic speech recognizers have become quite good in recognizing well prepared fluent speech (e.g. news readings). However, the recognition of spontaneous speech is still problematic. Some reasons for this are that spontaneous speech is usually less articulated and that it can contain a lot of disfluencies such as filled pauses (FPs), abbreviatons, repetitions, etc. In this paper, a new methodology for coping with FPs is presented. The basic idea is to detect FPs, and let this information control/modify the search for word hypotheses. Just counting normal words (excluding FPs), we can presently eliminate about one word error per FP occurring in the speech, and this without introducing a significant augmentation of the computational load.

Proceeding of the 16th ACM international conference on Multimedia - MM '08, 2008

Journal of New Music Research, 2014

Lecture Notes in Computer Science, 2003

2011 First International Conference on Informatics and Computational Intelligence, 2011

2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), 2007

symposium.elis.ugent.be

Neural Processing Letters, 1996

Neural Networks, 1991