Thomas Kemp - Academia.edu (original) (raw)

Papers by Thomas Kemp

Research paper thumbnail of 5. Regelbasiert generierte Aussprachevarianten für Spontansprache

Natural Language Processing and Speech Technology, 1996

We investigate the occurrence of six classes of pronounciation variants i n a v ery large corpus ... more We investigate the occurrence of six classes of pronounciation variants i n a v ery large corpus of spontaneous german speech. A high correlation between the dialect region of the speaker, the speaking rate and the word position within the utterance is observed. The integration of the pronounciation variants into a speech recognition system yields moderate improvement of the word error rate.

Research paper thumbnail of Estimating confidence using word lattices

5th European Conference on Speech Communication and Technology (Eurospeech 1997)

For many practical applications of speech recognition systems, it is desirable to have an estimat... more For many practical applications of speech recognition systems, it is desirable to have an estimate of con dence for each hypothesized word, i.e. to have an estimate which words of the speech recognizer's output are likely to be correct and which are not reliable. Many o f t o d a y's speech recognition systems use word lattices as a compact representation of a set of alternative hypothesis. We exploit the use of such word lattices as information sources for the measure-of-con dence tagger JANKA 1. In experiments on spontaneous human-to-human speech data the use of word lattice related information signi cantly improves the tagging accuracy.

Research paper thumbnail of Unsupervised training of a speech recognizer: recent experiments

6th European Conference on Speech Communication and Technology (Eurospeech 1999)

Current speech recognition systems require large amounts of transcribed data for parameter estima... more Current speech recognition systems require large amounts of transcribed data for parameter estimation. The transcription, however, is tedious and expensive. In this work we describe our experiments which are aimed at training a speech recognizer with only a minimal amount (30 minutes) of transcriptions and a large portion (50 hours) of untranscribed data. A recognizer is bootstrapped on the transcribed part of the data and initial transcripts are generated with it for the remainder (the untranscribed part). Using a lattice-based con dence measure, the recognition errors are (partially) detected and the remainder of the hypotheses is used for training. Using this scheme, the word error rate on a broadcast news speech recognition task dropped from more than 32.0% to 21.4%. In a cheating experiment we show, that this performance cannot be signi cantly improved by improving the measure of con dence. By combining the unsupervisedly trained system with our currently best recognizer which is trained on 15.5 hours of transcribed data, an additional error reduction of 5% relative (as compared to the system trained in a standard fashion) is possible.

Research paper thumbnail of Mixed Precision DNNs: All you need is a good parametrization

Efficient deep neural network (DNN) inference on mobile or embedded devices typically involves qu... more Efficient deep neural network (DNN) inference on mobile or embedded devices typically involves quantization of the network parameters and activations. In particular, mixed precision networks achieve better performance than networks with homogeneous bitwidth for the same size constraint. Since choosing the optimal bitwidths is not straight forward, training methods, which can learn them, are desirable. Differentiable quantization with straight-through gradients allows to learn the quantizer's parameters using gradient methods. We show that a suited parametrization of the quantizer is the key to achieve a stable training and a good final performance. Specifically, we propose to parametrize the quantizer with the step size and dynamic range. The bitwidth can then be inferred from them. Other parametrizations, which explicitly use the bitwidth, consistently perform worse. We confirm our findings with experiments on CIFAR-10 and ImageNet and we obtain mixed precision DNNs with learne...

Research paper thumbnail of Regelbasiert generierte Aussprachevarianten für Spontansprache

We investigate the occurrence of six classes of pronounciation variants i n a v ery large corpus ... more We investigate the occurrence of six classes of pronounciation variants i n a v ery large corpus of spontaneous german speech. A high correlation between the dialect region of the speaker, the speaking rate and the word position within the utterance is observed. The integration of the pronounciation variants into a speech recognition system yields moderate improvement of the word error rate.

Research paper thumbnail of Strategies for automatic segmentation of audio data

2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100)

In many applications, like indexing of broadcast news or surveillance applications, the input dat... more In many applications, like indexing of broadcast news or surveillance applications, the input data consists of a continuous, unsegmented audio stream. Speech recognition technology, however, usually requires segments of relatively short length as input. For such applications, effective methods to segment continuous audio streams into homogeneous segments are required. In this paper, three different segmenting strategies (model-based, metric-based and energy-based) are compared on the same broadcast news test data. It is shown that model-based and metric-based techniques outperform the simpler energy-based algorithms. While model-based segmenters achieve very high level of segment boundary precision, the metric-based segmenter performes better in terms of segment boundary recall (RCL). To combine the advantages of both strategies, a new hybrid algorithm is introduced. For this, the results of a preliminary metric-based segmentation are used to construct the models for the final model-based segmenter run. The new hybrid approach is shown to outperform the other segmenting strategies.

Research paper thumbnail of Janus - towards multilingual spoken language translation

In our effort to build spoken language translation systems we have extended our JANUS system to p... more In our effort to build spoken language translation systems we have extended our JANUS system to process spontaneous human-human dialogs in a new domain, two people trying to schedule a meeting. Trained on an initial database JANUS-2 is able to translate English and German spoken input in either English, German, Spanish, Japanese or Korean output. To tackle the difficulty of spontaneous human-human dialogs we improved the JANUS-2 recognizer along its three knowledge sources acoustic models, dictionary and language models. We developed a robust translation system which performs semantic rather than syntactic analysis and thus is particulary suited to processing spontaneous speech. We describe repair methods to recover from recognition errors. 1 The about 18000 utterances in English Scheduling correspond to some 30000 sentences.

Research paper thumbnail of Mixed Precision DNNs: All you need is a good parametrization

Efficient deep neural network (DNN) inference on mobile or embedded devices typically involves qu... more Efficient deep neural network (DNN) inference on mobile or embedded devices typically involves quantization of the network parameters and activations. In particular, mixed precision networks achieve better performance than networks with homogeneous bitwidth for the same size constraint. Since choosing the optimal bitwidths is not straight forward, training methods, which can learn them, are desirable. Differentiable quantization with straight-through gradients allows to learn the quantizer's parameters using gradient methods. We show that a suited parametrization of the quantizer is the key to achieve a stable training and a good final performance. Specifically, we propose to parametrize the quantizer with the step size and dynamic range. The bitwidth can then be inferred from them. Other parametrizations, which explicitly use the bitwidth, consistently perform worse. We confirm our findings with experiments on CIFAR-10 and ImageNet and we obtain mixed precision DNNs with learne...

Research paper thumbnail of Confidence measures for spontaneous speech recognition

1997 IEEE International Conference on Acoustics, Speech, and Signal Processing

For many practical applications of speech recognition systems, it is desirable to have an estimat... more For many practical applications of speech recognition systems, it is desirable to have an estimate of condence for each h ypothesized word, i.e. to have an estimate of which words of the output of the speech recognizer are likely to be correct and which are not reliable. We describe the development of the measure of condence tagger JANKA, which i s able to provide condence information for the words in the output of the speech recognizer JANUS-3-SR. On a spontaneous german human-to-human database, JANKA achieves a tagging accuracy of 90% at a baseline word accuracy of 82%.

Research paper thumbnail of Differentiable Quantization of Deep Neural Networks

ArXiv, 2019

We propose differentiable quantization (DQ) for efficient deep neural network (DNN) inference whe... more We propose differentiable quantization (DQ) for efficient deep neural network (DNN) inference where gradient descent is used to learn the quantizer's step size, dynamic range and bitwidth. Training with differentiable quantizers brings two main benefits: first, DQ does not introduce hyperparameters; second, we can learn for each layer a different step size, dynamic range and bitwidth. Our experiments show that DNNs with heterogeneous and learned bitwidth yield better performance than DNNs with a homogeneous one. Further, we show that there is one natural DQ parametrization especially well suited for training. We confirm our findings with experiments on CIFAR-10 and ImageNet and we obtain quantized DNNs with learned quantization parameters achieving state-of-the-art performance.

Research paper thumbnail of Speech Synthesis and Control Using Differentiable DSP

ArXiv, 2020

Modern text-to-speech systems are able to produce natural and high-quality speech, but speech con... more Modern text-to-speech systems are able to produce natural and high-quality speech, but speech contains factors of variation (e.g. pitch, rhythm, loudness, timbre)\ that text alone cannot contain. In this work we move towards a speech synthesis system that can produce diverse speech renditions of a text by allowing (but not requiring) explicit control over the various factors of variation. We propose a new neural vocoder that offers control of such factors of variation. This is achieved by employing differentiable digital signal processing (DDSP) (previously used only for music rather than speech), which exposes these factors of variation. The results show that the proposed approach can produce natural speech with realistic timbre, and individual factors of variation can be freely controlled.

Research paper thumbnail of Iteratively Training Look-Up Tables for Network Quantization

IEEE Journal of Selected Topics in Signal Processing, 2020

Operating deep neural networks (DNNs) on devices with limited resources requires the reduction of... more Operating deep neural networks (DNNs) on devices with limited resources requires the reduction of their memory as well as computational footprint. Popular reduction methods are network quantization or pruning, which either reduce the word length of the network parameters or remove weights from the network if they are not needed. In this article we discuss a general framework for network reduction which we call Look-Up Table Quantization (LUT-Q). For each layer, we learn a value dictionary and an assignment matrix to represent the network weights. We propose a special solver which combines gradient descent and a one-step k-means update to learn both the value dictionaries and assignment matrices iteratively. This method is very flexible: by constraining the value dictionary, many different reduction problems such as non-uniform network quantization, training of multiplierless networks, network pruning or simultaneous quantization and pruning can be implemented without changing the solver. This flexibility of the LUT-Q method allows us to use the same method to train networks for different hardware capabilities.

Research paper thumbnail of JANUS-II-translation of spontaneous conversational speech

1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings

Research paper thumbnail of The Karlsruhe-Verbmobil speech recognition engine

1997 IEEE International Conference on Acoustics, Speech, and Signal Processing

Verbmobil, a German research project, aims at machine translation of spontaneous speech input. Th... more Verbmobil, a German research project, aims at machine translation of spontaneous speech input. The ultimate goal is the development o f a portable machine translator that will allow people to negotiate in their native language. Within this project the University of Karlsruhe has developed a speech recognition engine that has been evaluated on a y early basis during the project and shows very promising speech recognition word accuracy results on large vocabulary spontaneous speech. In this paper we will introduce the Janus Speech Recognition Toolkit underlying the speech recognizer. The main new contributions to the acoustic modeling part of our 1996 evaluation system speaker normalization, channel normalization and polyphonic clustering will be discussed and evaluated. Besides the acoustic models we delineate the di erent language models used in our evaluation system: Word trigram models interpolated with class based models and a separate spelling language model were applied. As a result of using the toolkit and integrating all these parts into the recognition engine the word error rate on the German Spontaneous Scheduling Task GSST could be decreased from 30 word error rate in 1995 to 13.8 in 1996.

Research paper thumbnail of Integrating different learning approaches into a multilingual spoken language translation system

Lecture Notes in Computer Science, 1996

recognition engine, developed at the University of Karlsruhe, is part of the VERBMOBIL project an... more recognition engine, developed at the University of Karlsruhe, is part of the VERBMOBIL project and VERBMOBIL systems developed under BMBF funding. The Spanish speech translation module has been developed at Carnegie Mellon University under project ENTHUSIAST funded by the US Government. Other components are under development in collaboration with partners of the C-STAR Consortium.

Research paper thumbnail of JANUS 93: towards spontaneous speech translation

Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing

We presen t rst resul t sf r o mour eorts toward translation o f s p ontaneously spoken speech. I... more We presen t rst resul t sf r o mour eorts toward translation o f s p ontaneously spoken speech. Im provem ents inude i n creasing c o verage, robustness, general i t y a nd s p eed JANUS, t h e s p eech-t o-s p eech translation s y stemof egi eM e l l o n a nd K arlsruhe U ni v ersity. Recogni t i o n hi n e T ranslation E n gi n e h ave b een upgraded to equi r e m ents in troduced by s p ontaneous h um an s. T o a l l o w f o r d evel opm ent a nd e v al u an a dequate data, a l a rge d atabase w i t h di a l o gs i s b ei ng g athered f or E nh.

Research paper thumbnail of Modelling unknown words in spontaneous speech

1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings

In this paper we describe our experiments with dierent acoustic and language models for unknown w... more In this paper we describe our experiments with dierent acoustic and language models for unknown words in spontaneous speech. We propose a syllable based approach for the acoustic modelling of new words. Several models of dierent degrees of complexity are evaluated against each other. We show that the modelling of new words can decrease the error rate in the recognition of spontaneous human-to-human speech. In addition, the new word models can be used as a measure of condence capable of detecting errors in the recognition of spontaneous speech. Although the best performance is reached by applying phonetic a-priori knowledge in the design of the acoustic models, a pure data-driven approach is proposed which performs only slightly less eciently.

Research paper thumbnail of Confidence measure based Language Identification

I Ok 57h , 1.0% 93.4 23.0% 2.8k 30h 2.6% 17.2 12.4% data rate PP rate In this paper we present a ... more I Ok 57h , 1.0% 93.4 23.0% 2.8k 30h 2.6% 17.2 12.4% data rate PP rate In this paper we present a new application for confidence measures in spoken language processing. In today's computerized dialogue systems, language identification (LID) is typically achieved via dedicated modules. In our approach, LID is integrated into the speech recognizer, therefore profiting from high-level linguistic knowledge at very little extra cost. Our new approach is based on a word lattice based confidence measure [3], which was originally devised for unsupervised training. In this work, we show that the confidence based language identification algorithm outperforms conventional score based methods. Also, this method is less dependent on the acoustic characteristics of the transmission channel than score based methods. By introducing additional parameters, unknown languages can be rejected. The proposed method is compared to a score based approach on the Verbmobil database, a three language task.

Research paper thumbnail of Janus II-Advances in Spontaneous Speech Translation

JANUS II is a research system to design and test components of speech to speech translation syste... more JANUS II is a research system to design and test components of speech to speech translation systems as well as a research prototype for such a system. We will focus on two aspects of the system: 1) new features and recognition performance of the speech recognition component JANUS-SR and 2) the end-to-end performance of JANUS II, including a comparison of two machine translation strategies used for JANUS-MT (PHOENIX and GLR*). 1. INTRODUCTION Currently JANUS II components for English, German, ...

Research paper thumbnail of Unsupervised training of a speech recognizer using TV broadcasts

5th International Conference on Spoken Language Processing (ICSLP 1998)

Current speech recognition systems require large amounts of transcribed data for parameter estima... more Current speech recognition systems require large amounts of transcribed data for parameter estimation. The transcription, however, is tedious and expensive. In this work we describe our experiments which are aimed at training a speech recognizer without transcriptions. The experiments were carried out with TV newscasts, that were recorded using a satellite receiver and a simple MPEG coding hardware. The newscasts were automatically segmented into segments of similar acoustic background condition. This material is inexpensive and can be made available in large quantities, but there are no transcriptions available. We develop a training scheme, where a recognizer is bootstrapped using very little transcribed data and is improved using new, untranscribed speech. We show that it is necessary to use a con dence measure to judge the initial transcriptions of the recognizer before using them. Higher improvements can be achieved if the number of parameters in the system is increased when more data becomes available. We show, that the bene cial e ect of unsupervised training is not compensated by MLLR adaptation on the hypothesis. In a nal experiment, the e ect of untranscribed data is compared with the e ect of transcribed speech. Using the described methods, we found that the untranscribed data gives roughly one third of the improvement of the transcribed material.

Research paper thumbnail of 5. Regelbasiert generierte Aussprachevarianten für Spontansprache

Natural Language Processing and Speech Technology, 1996

We investigate the occurrence of six classes of pronounciation variants i n a v ery large corpus ... more We investigate the occurrence of six classes of pronounciation variants i n a v ery large corpus of spontaneous german speech. A high correlation between the dialect region of the speaker, the speaking rate and the word position within the utterance is observed. The integration of the pronounciation variants into a speech recognition system yields moderate improvement of the word error rate.

Research paper thumbnail of Estimating confidence using word lattices

5th European Conference on Speech Communication and Technology (Eurospeech 1997)

For many practical applications of speech recognition systems, it is desirable to have an estimat... more For many practical applications of speech recognition systems, it is desirable to have an estimate of con dence for each hypothesized word, i.e. to have an estimate which words of the speech recognizer's output are likely to be correct and which are not reliable. Many o f t o d a y's speech recognition systems use word lattices as a compact representation of a set of alternative hypothesis. We exploit the use of such word lattices as information sources for the measure-of-con dence tagger JANKA 1. In experiments on spontaneous human-to-human speech data the use of word lattice related information signi cantly improves the tagging accuracy.

Research paper thumbnail of Unsupervised training of a speech recognizer: recent experiments

6th European Conference on Speech Communication and Technology (Eurospeech 1999)

Current speech recognition systems require large amounts of transcribed data for parameter estima... more Current speech recognition systems require large amounts of transcribed data for parameter estimation. The transcription, however, is tedious and expensive. In this work we describe our experiments which are aimed at training a speech recognizer with only a minimal amount (30 minutes) of transcriptions and a large portion (50 hours) of untranscribed data. A recognizer is bootstrapped on the transcribed part of the data and initial transcripts are generated with it for the remainder (the untranscribed part). Using a lattice-based con dence measure, the recognition errors are (partially) detected and the remainder of the hypotheses is used for training. Using this scheme, the word error rate on a broadcast news speech recognition task dropped from more than 32.0% to 21.4%. In a cheating experiment we show, that this performance cannot be signi cantly improved by improving the measure of con dence. By combining the unsupervisedly trained system with our currently best recognizer which is trained on 15.5 hours of transcribed data, an additional error reduction of 5% relative (as compared to the system trained in a standard fashion) is possible.

Research paper thumbnail of Mixed Precision DNNs: All you need is a good parametrization

Efficient deep neural network (DNN) inference on mobile or embedded devices typically involves qu... more Efficient deep neural network (DNN) inference on mobile or embedded devices typically involves quantization of the network parameters and activations. In particular, mixed precision networks achieve better performance than networks with homogeneous bitwidth for the same size constraint. Since choosing the optimal bitwidths is not straight forward, training methods, which can learn them, are desirable. Differentiable quantization with straight-through gradients allows to learn the quantizer's parameters using gradient methods. We show that a suited parametrization of the quantizer is the key to achieve a stable training and a good final performance. Specifically, we propose to parametrize the quantizer with the step size and dynamic range. The bitwidth can then be inferred from them. Other parametrizations, which explicitly use the bitwidth, consistently perform worse. We confirm our findings with experiments on CIFAR-10 and ImageNet and we obtain mixed precision DNNs with learne...

Research paper thumbnail of Regelbasiert generierte Aussprachevarianten für Spontansprache

We investigate the occurrence of six classes of pronounciation variants i n a v ery large corpus ... more We investigate the occurrence of six classes of pronounciation variants i n a v ery large corpus of spontaneous german speech. A high correlation between the dialect region of the speaker, the speaking rate and the word position within the utterance is observed. The integration of the pronounciation variants into a speech recognition system yields moderate improvement of the word error rate.

Research paper thumbnail of Strategies for automatic segmentation of audio data

2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100)

In many applications, like indexing of broadcast news or surveillance applications, the input dat... more In many applications, like indexing of broadcast news or surveillance applications, the input data consists of a continuous, unsegmented audio stream. Speech recognition technology, however, usually requires segments of relatively short length as input. For such applications, effective methods to segment continuous audio streams into homogeneous segments are required. In this paper, three different segmenting strategies (model-based, metric-based and energy-based) are compared on the same broadcast news test data. It is shown that model-based and metric-based techniques outperform the simpler energy-based algorithms. While model-based segmenters achieve very high level of segment boundary precision, the metric-based segmenter performes better in terms of segment boundary recall (RCL). To combine the advantages of both strategies, a new hybrid algorithm is introduced. For this, the results of a preliminary metric-based segmentation are used to construct the models for the final model-based segmenter run. The new hybrid approach is shown to outperform the other segmenting strategies.

Research paper thumbnail of Janus - towards multilingual spoken language translation

In our effort to build spoken language translation systems we have extended our JANUS system to p... more In our effort to build spoken language translation systems we have extended our JANUS system to process spontaneous human-human dialogs in a new domain, two people trying to schedule a meeting. Trained on an initial database JANUS-2 is able to translate English and German spoken input in either English, German, Spanish, Japanese or Korean output. To tackle the difficulty of spontaneous human-human dialogs we improved the JANUS-2 recognizer along its three knowledge sources acoustic models, dictionary and language models. We developed a robust translation system which performs semantic rather than syntactic analysis and thus is particulary suited to processing spontaneous speech. We describe repair methods to recover from recognition errors. 1 The about 18000 utterances in English Scheduling correspond to some 30000 sentences.

Research paper thumbnail of Mixed Precision DNNs: All you need is a good parametrization

Efficient deep neural network (DNN) inference on mobile or embedded devices typically involves qu... more Efficient deep neural network (DNN) inference on mobile or embedded devices typically involves quantization of the network parameters and activations. In particular, mixed precision networks achieve better performance than networks with homogeneous bitwidth for the same size constraint. Since choosing the optimal bitwidths is not straight forward, training methods, which can learn them, are desirable. Differentiable quantization with straight-through gradients allows to learn the quantizer's parameters using gradient methods. We show that a suited parametrization of the quantizer is the key to achieve a stable training and a good final performance. Specifically, we propose to parametrize the quantizer with the step size and dynamic range. The bitwidth can then be inferred from them. Other parametrizations, which explicitly use the bitwidth, consistently perform worse. We confirm our findings with experiments on CIFAR-10 and ImageNet and we obtain mixed precision DNNs with learne...

Research paper thumbnail of Confidence measures for spontaneous speech recognition

1997 IEEE International Conference on Acoustics, Speech, and Signal Processing

For many practical applications of speech recognition systems, it is desirable to have an estimat... more For many practical applications of speech recognition systems, it is desirable to have an estimate of condence for each h ypothesized word, i.e. to have an estimate of which words of the output of the speech recognizer are likely to be correct and which are not reliable. We describe the development of the measure of condence tagger JANKA, which i s able to provide condence information for the words in the output of the speech recognizer JANUS-3-SR. On a spontaneous german human-to-human database, JANKA achieves a tagging accuracy of 90% at a baseline word accuracy of 82%.

Research paper thumbnail of Differentiable Quantization of Deep Neural Networks

ArXiv, 2019

We propose differentiable quantization (DQ) for efficient deep neural network (DNN) inference whe... more We propose differentiable quantization (DQ) for efficient deep neural network (DNN) inference where gradient descent is used to learn the quantizer's step size, dynamic range and bitwidth. Training with differentiable quantizers brings two main benefits: first, DQ does not introduce hyperparameters; second, we can learn for each layer a different step size, dynamic range and bitwidth. Our experiments show that DNNs with heterogeneous and learned bitwidth yield better performance than DNNs with a homogeneous one. Further, we show that there is one natural DQ parametrization especially well suited for training. We confirm our findings with experiments on CIFAR-10 and ImageNet and we obtain quantized DNNs with learned quantization parameters achieving state-of-the-art performance.

Research paper thumbnail of Speech Synthesis and Control Using Differentiable DSP

ArXiv, 2020

Modern text-to-speech systems are able to produce natural and high-quality speech, but speech con... more Modern text-to-speech systems are able to produce natural and high-quality speech, but speech contains factors of variation (e.g. pitch, rhythm, loudness, timbre)\ that text alone cannot contain. In this work we move towards a speech synthesis system that can produce diverse speech renditions of a text by allowing (but not requiring) explicit control over the various factors of variation. We propose a new neural vocoder that offers control of such factors of variation. This is achieved by employing differentiable digital signal processing (DDSP) (previously used only for music rather than speech), which exposes these factors of variation. The results show that the proposed approach can produce natural speech with realistic timbre, and individual factors of variation can be freely controlled.

Research paper thumbnail of Iteratively Training Look-Up Tables for Network Quantization

IEEE Journal of Selected Topics in Signal Processing, 2020

Operating deep neural networks (DNNs) on devices with limited resources requires the reduction of... more Operating deep neural networks (DNNs) on devices with limited resources requires the reduction of their memory as well as computational footprint. Popular reduction methods are network quantization or pruning, which either reduce the word length of the network parameters or remove weights from the network if they are not needed. In this article we discuss a general framework for network reduction which we call Look-Up Table Quantization (LUT-Q). For each layer, we learn a value dictionary and an assignment matrix to represent the network weights. We propose a special solver which combines gradient descent and a one-step k-means update to learn both the value dictionaries and assignment matrices iteratively. This method is very flexible: by constraining the value dictionary, many different reduction problems such as non-uniform network quantization, training of multiplierless networks, network pruning or simultaneous quantization and pruning can be implemented without changing the solver. This flexibility of the LUT-Q method allows us to use the same method to train networks for different hardware capabilities.

Research paper thumbnail of JANUS-II-translation of spontaneous conversational speech

1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings

Research paper thumbnail of The Karlsruhe-Verbmobil speech recognition engine

1997 IEEE International Conference on Acoustics, Speech, and Signal Processing

Verbmobil, a German research project, aims at machine translation of spontaneous speech input. Th... more Verbmobil, a German research project, aims at machine translation of spontaneous speech input. The ultimate goal is the development o f a portable machine translator that will allow people to negotiate in their native language. Within this project the University of Karlsruhe has developed a speech recognition engine that has been evaluated on a y early basis during the project and shows very promising speech recognition word accuracy results on large vocabulary spontaneous speech. In this paper we will introduce the Janus Speech Recognition Toolkit underlying the speech recognizer. The main new contributions to the acoustic modeling part of our 1996 evaluation system speaker normalization, channel normalization and polyphonic clustering will be discussed and evaluated. Besides the acoustic models we delineate the di erent language models used in our evaluation system: Word trigram models interpolated with class based models and a separate spelling language model were applied. As a result of using the toolkit and integrating all these parts into the recognition engine the word error rate on the German Spontaneous Scheduling Task GSST could be decreased from 30 word error rate in 1995 to 13.8 in 1996.

Research paper thumbnail of Integrating different learning approaches into a multilingual spoken language translation system

Lecture Notes in Computer Science, 1996

recognition engine, developed at the University of Karlsruhe, is part of the VERBMOBIL project an... more recognition engine, developed at the University of Karlsruhe, is part of the VERBMOBIL project and VERBMOBIL systems developed under BMBF funding. The Spanish speech translation module has been developed at Carnegie Mellon University under project ENTHUSIAST funded by the US Government. Other components are under development in collaboration with partners of the C-STAR Consortium.

Research paper thumbnail of JANUS 93: towards spontaneous speech translation

Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing

We presen t rst resul t sf r o mour eorts toward translation o f s p ontaneously spoken speech. I... more We presen t rst resul t sf r o mour eorts toward translation o f s p ontaneously spoken speech. Im provem ents inude i n creasing c o verage, robustness, general i t y a nd s p eed JANUS, t h e s p eech-t o-s p eech translation s y stemof egi eM e l l o n a nd K arlsruhe U ni v ersity. Recogni t i o n hi n e T ranslation E n gi n e h ave b een upgraded to equi r e m ents in troduced by s p ontaneous h um an s. T o a l l o w f o r d evel opm ent a nd e v al u an a dequate data, a l a rge d atabase w i t h di a l o gs i s b ei ng g athered f or E nh.

Research paper thumbnail of Modelling unknown words in spontaneous speech

1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings

In this paper we describe our experiments with dierent acoustic and language models for unknown w... more In this paper we describe our experiments with dierent acoustic and language models for unknown words in spontaneous speech. We propose a syllable based approach for the acoustic modelling of new words. Several models of dierent degrees of complexity are evaluated against each other. We show that the modelling of new words can decrease the error rate in the recognition of spontaneous human-to-human speech. In addition, the new word models can be used as a measure of condence capable of detecting errors in the recognition of spontaneous speech. Although the best performance is reached by applying phonetic a-priori knowledge in the design of the acoustic models, a pure data-driven approach is proposed which performs only slightly less eciently.

Research paper thumbnail of Confidence measure based Language Identification

I Ok 57h , 1.0% 93.4 23.0% 2.8k 30h 2.6% 17.2 12.4% data rate PP rate In this paper we present a ... more I Ok 57h , 1.0% 93.4 23.0% 2.8k 30h 2.6% 17.2 12.4% data rate PP rate In this paper we present a new application for confidence measures in spoken language processing. In today's computerized dialogue systems, language identification (LID) is typically achieved via dedicated modules. In our approach, LID is integrated into the speech recognizer, therefore profiting from high-level linguistic knowledge at very little extra cost. Our new approach is based on a word lattice based confidence measure [3], which was originally devised for unsupervised training. In this work, we show that the confidence based language identification algorithm outperforms conventional score based methods. Also, this method is less dependent on the acoustic characteristics of the transmission channel than score based methods. By introducing additional parameters, unknown languages can be rejected. The proposed method is compared to a score based approach on the Verbmobil database, a three language task.

Research paper thumbnail of Janus II-Advances in Spontaneous Speech Translation

JANUS II is a research system to design and test components of speech to speech translation syste... more JANUS II is a research system to design and test components of speech to speech translation systems as well as a research prototype for such a system. We will focus on two aspects of the system: 1) new features and recognition performance of the speech recognition component JANUS-SR and 2) the end-to-end performance of JANUS II, including a comparison of two machine translation strategies used for JANUS-MT (PHOENIX and GLR*). 1. INTRODUCTION Currently JANUS II components for English, German, ...

Research paper thumbnail of Unsupervised training of a speech recognizer using TV broadcasts

5th International Conference on Spoken Language Processing (ICSLP 1998)

Current speech recognition systems require large amounts of transcribed data for parameter estima... more Current speech recognition systems require large amounts of transcribed data for parameter estimation. The transcription, however, is tedious and expensive. In this work we describe our experiments which are aimed at training a speech recognizer without transcriptions. The experiments were carried out with TV newscasts, that were recorded using a satellite receiver and a simple MPEG coding hardware. The newscasts were automatically segmented into segments of similar acoustic background condition. This material is inexpensive and can be made available in large quantities, but there are no transcriptions available. We develop a training scheme, where a recognizer is bootstrapped using very little transcribed data and is improved using new, untranscribed speech. We show that it is necessary to use a con dence measure to judge the initial transcriptions of the recognizer before using them. Higher improvements can be achieved if the number of parameters in the system is increased when more data becomes available. We show, that the bene cial e ect of unsupervised training is not compensated by MLLR adaptation on the hypothesis. In a nal experiment, the e ect of untranscribed data is compared with the e ect of transcribed speech. Using the described methods, we found that the untranscribed data gives roughly one third of the improvement of the transcribed material.