Speech translation enhanced automatic speech recognition (original) (raw)
Related papers
2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007
In this paper we describe our work in coupling automatic speech recognition (ASR) and machine translation (MT) in a speech translation enhanced automatic speech recognition (STE-ASR) framework for transcribing and translating European Parliament speeches. We demonstrate the influence of the quality of the ASR component on the MT performance, by comparing a series of WERs with the corresponding automatic translation scores. By porting an STE-ASR framework to the task at hand, we show how the word errors for transcribing English and Spanish speeches can be lowered by 3.0% and 4.8% relative, respectively.
Document driven machine translation enhanced ASR
Interspeech 2005
In human-mediated translation scenarios a human interpreter translates between a source and a target language using either a spoken or a written representation of the source language. In this paper we improve the recognition performance on the speech of the human translator spoken in the target language by taking advantage of the source language representations. We use machine translation techniques to translate between the source and target language resources and then bias the target language speech recognizer towards the gained knowledge, hence the name Machine Translation Enhanced Automatic Speech Recognition. We investigate several different techniques among which are restricting the search vocabulary, selecting hypotheses from n-best lists, applying cache and interpolation schemes to language modeling, and combining the most successful techniques into our final, iterative system. Overall we outperform the baseline system by a relative word error rate reduction of 37.6%.
Integration of ASR and machine translation models in a document translation task
(InterSpeech 2007), 2007
This paper is concerned with the problem of machine aided human language translation. It addresses a translation scenario where a human translator dictates the spoken language translation of a source language text into an automatic speech dictation system. The source language text in this scenario is also presented to a statistical machine translation system (SMT). The techniques presented in the paper assume that the optimum target language word string which is produced by the dictation system is modeled using the combined SMT and ASR statistical models. These techniques were evaluated on a speech corpus involving human translators dictating English language translations of French language text obtained from transcriptions of the proceedings of the Canadian House of Commons. It will be shown in the paper that the combined ASR/SMT modeling techniques described in the paper were able to reduce ASR WER by 26.6 percent relative to the WER of an ASR system that did not incorporate SMT knowledge.
Evaluating Automatic Speech Recognition in Translation
2018
We address and evaluate the challenges of utilizing Automatic Speech Recognition (ASR) to support the human translator. Audio transcription and translation are known to be far more time-consuming than text translation; at least 2 to 3 times longer. Furthermore, time to translate or transcribe audio is vastly dependent on audio quality, which can be impaired by background noise, overlapping voices, and other acoustic conditions. The purpose of this paper is to explore the integration of ASR in the translation workflow and evaluate the challenges of utilizing ASR to support the human translator. We present several case studies in different settings in order to evaluate the benefits of ASR. Time is the primary factor in this evaluation. We show that ASR might be effectively used to assist, but not replace, the human translator in essential ways.
The RWTH Aachen speech recognition and machine translation system for IWSLT 2012
In this paper, the automatic speech recognition (ASR) and statistical machine translation (SMT) systems of RWTH Aachen University developed for the evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT) 2012 are presented. We participated in the ASR (English), MT (English-French, Arabic-English, Chinese-English, German-English) and SLT (English-French) tracks. For the MT track both hierarchical and phrase-based SMT decoders are applied. A number of different techniques are evaluated in the MT and SLT tracks, including domain adaptation via data selection, translation model interpolation, phrase training for hierarchical and phrase-based systems, additional reordering model, word class language model, various Arabic and Chinese segmentation methods, postprocessing of speech recognition output with an SMT system, and system combination. By application of these methods we can show considerable improvements over the respective baseline systems.
The irst english-spanish translation system for european parliament speeches
2007
This paper presents the spoken language translation system developed at FBK-irst during the TC-STAR project. The system integrates automatic speech recognition with machine translation through the use of confusion networks, which permit to represent a huge number of transcription hypotheses generated by the speech recognizer. Confusion networks are efficiently decoded by a statistical machine translation system which computes the most probable translation in the target language. This paper presents the whole architecture developed for the translation of political speeches held at the European Parliament, from English to Spanish and vice versa, and at the Spanish Parliament, from Spanish to English.
Adaptation of lecture speech recognition system with machine translation output
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
In spoken language translation, integration of the ASR and MT components is critical for good performance. In this paper, we consider the recognition setting where a text translation of each utterance is also available. We present experiments with different ASR system adaptation techniques to exploit MT system outputs. In particular, N-best MT outputs are represented as an utterance-specific language model, which are then used to rescore ASR lattices. We show that this method improves significantly over ASR alone, resulting in an absolute WER reduction of more than 6% for both indomain and out-of-domain acoustic models.
Investigating translation of Parliament speeches
IEEE Workshop on Automatic Speech Recognition and Understanding, 2005., 2005
This paper reports on recent experiments for speech to text (STT) translation of European Parliamentary speeches. A Spanish speech to English text translation system has been built using data from the TC-STAR European project. The speech recognizer is a state-of-the-art multipass system trained for the Spanish EPPS task and the statistical translation system relies on the IBM-4 model. First, MT results are compared using manual transcriptions and 1-best ASR hypotheses with different word error rates. Then, an n-best interface between the ASR and MT components is investigated to improve the STT process. Derivation of the fundamental equation for machine translation suggests that the source language model is not necessary for STT. This was investigated by using weak source language models and by n-best rescoring adding the acoustic model score only. A significant loss in the BLEU score was observed suggesting that the source language model is needed given the insufficiencies of the translation model. Adding the source language model score in the n-best rescoring process recovers the loss and slightly improves the BLEU score over the 1-best ASR hypothesis. The system achieves a BLEU score of 37.3 with an ASR word error rate of 10% and a BLEU score of 40.5 using the manual transcripts.
Spoken language translation using automatically transcribed text in training
In spoken language translation a machine translation system takes speech as input and translates it into another language. A standard machine translation system is trained on written language data and expects written language as input. In this paper we propose an approach to close the gap between the output of automatic speech recognition and the input of machine translation by training the translation system on automatically transcribed speech. In our experiments we show improvements of up to 0.9 BLEU points on the IWSLT 2012 English-to-French speech translation task.
KIT's Multilingual Speech Translation System for IWSLT 2023
arXiv (Cornell University), 2023
Many existing speech translation benchmarks focus on native-English speech in high-quality recording conditions, which often do not match the conditions in real-life use-cases. In this paper, we describe our speech translation system for the multilingual track of IWSLT 2023, which evaluates translation quality on scientific conference talks. The test condition features accented input speech and terminology-dense contents. The task requires translation into 10 languages of varying amounts of resources. In absence of training data from the target domain, we use a retrieval-based approach (kNN-MT) for effective adaptation (+0.8 BLEU for speech translation). We also use adapters to easily integrate incremental training data from data augmentation, and show that it matches the performance of retraining. We observe that cascaded systems are more easily adaptable towards specific target domains, due to their separate modules. Our cascaded speech system substantially outperforms its end-to-end counterpart on scientific talk translation, although their performance remains similar on TED talks. 1