The CMU statistical machine translation system (original) (raw)
Related papers
Bilingual phrases for statistical machine translation
2005
The statistical framework has proved to be very successful in machine translation. The main reason for this success is the existence of powerful techniques that allow to build machine translation systems automatically from available parallel corpora. Most of statistical machine translation approaches are based on single-word translation models, which do not take bilingual contextual information into account. The translation model in the phrase-based approach defines correspondences between sequences of contiguous source words (source segments) and sequences of contiguous target words (target segments) instead of only correspondences between single source words and single target words. That is, statistical phrase-based translation models make use of explicit bilingual contextual information. Different methods for the selection of adequate bilingual word sequences and for training the parameters of these models are reviewed in this paper. Improved techniques for the selection and training model parameters are also introduced. The phrase-based approach has been assessed in different tasks using different corpora and the results obtained are comparable or better than the ones obtained using other statistical and non-statistical machine translation systems.
Statistical Machine Translation
2016
Statistical Machine Translation (SMT) systems are based on bilingual sentence aligned data. The quality of translation depends on the data provided for translation learning. A huge parallel corpus is required for performing the statistical machine translation. The aim of this paper is to explore SMT using the Moses toolkit for creating a German-English translator. To perform the German to English translation, a parallel corpus of this language pair has been provided. Larger the size of the data provided for the training of the Moses decoder, more accurate is the translated output.
The Johns Hopkins University 2003 Chinese-English Machine Translation System
2003
We describe a Chinese to English Machine Translation system developed at the Johns Hopkins University for the NIST 2003 MT evaluation. The system is based on a Weighted Finite State Transducer implementation of the alignment template translation model for statistical machine translation. The baseline MT system was trained using 100,000 sentence pairs selected from a static bitext training collection. Information retrieval techniques were then used to create specific training collections for each document to be translated. This document-specific training set included bitext and name entities that were then added to the baseline system by augmenting the library of alignment templates. We report translation performance of baseline and IR-based systems on two NIST MT evaluation test sets.
Joint Phrase Alignment and Extraction for Statistical Machine Translation
Journal of Information Processing, 2012
The phrase table, a scored list of bilingual phrases, lies at the center of phrase-based machine translation systems. We present a method to directly learn this phrase table from a parallel corpus of sentences that are not aligned at the word level. The key contribution of this work is that while previous methods have generally only modeled phrases at one level of granularity, in the proposed method phrases of many granularities are included directly in the model. This allows for the direct learning of a phrase table that achieves competitive accuracy without the complicated multistep process of word alignment and phrase extraction that is used in previous research. The model is achieved through the use of non-parametric Bayesian methods and inversion transduction grammars (ITGs), a variety of synchronous context-free grammars (SCFGs). Experiments on several language pairs demonstrate that the proposed model matches the accuracy of the more traditional two-step word alignment/phrase extraction approach while reducing its phrase table to a fraction of its original size.
Experiments with a Noun-Phrase driven Statistical Machine Translation System
This paper presents a noun phrase driven two-level statistical machine translation system. Noun phrases (NPs) are used as the unit of decomposition to build a two level hierarchy of phrases. English noun phrases are identified using a parser. The corresponding translations are induced using a statistical word alignment model. Identified noun phrase pairs in the training corpus are replaced with a tag to produce a NP tagged corpus. This corpus is then used to extract phrase translation pairs. Both NP translations and NP-tagged phrases are used in a two-level translation decoder: NP translations tag NPs in the first level, where NP-tagged phrases match across NPs to produce translations in the second level. The two-level system shows significant improvements over a baseline SMT system. It also produces longer matching phrases due to the generalization introduced by tagging NPs.
2005
Nowadays, most of the statistical translation systems are based on phrases (i.e. groups of words). In this paper we study different improvements to the standard phrase-based translation system. We describe a modified method for the phrase extraction which deals with larger phrases while keeping a reasonable number of phrases. We also propose additional features which lead to a clear improvement in the performance of the translation. We present results with the EuroParl task in the direction Spanish to English and results from the evaluation of the shared task "Exploiting Parallel Texts for Statistical Machine Translation" (ACL Workshop on Parallel Texts 2005).
An Overview of Statistical Machine Translation Tools
International Journal of Advanced Research in Computer Science and Software Engineering
The process Machine translation is a combination of many complex sub-processes and the quality of results of each sub-process executed in a well defined sequence determine the overall accuracy of the translation. Statistical Machine Translation approach considers each sentence in target language as a possible translation of any source language sentence. The possibility is calculated by probability and as obvious, sentence with highest probability is treated as the best translation. SMT is the most favoured approach not only because of its good results for corpus rich language pairs, but also for the tools that SMT approach has been enhanced with in past two and half decades. The paper gives a brief introduction to SMT: its steps and different tools available for each step.
2010
This paper describes the techniques we explored to improve the translation of news text in the German-English and Hungarian-English tracks of the WMT09 shared translation task. Beginning with a convention hierarchical phrase-based system, we found benefits for using word segmentation lattices as input, explicit generation of beginning and end of sentence markers, minimum Bayes risk decoding, and incorporation of a feature scoring the alignment of function words in the hypothesized translation. We also explored the use of monolingual paraphrases to improve coverage, as well as co-training to improve the quality of the segmentation lattices used, but these did not lead to improvements.
2009
This paper describes the techniques we explored to improve the translation of news text in the German-English and Hungarian-English tracks of the WMT09 shared translation task. Beginning with a convention hierarchical phrase-based system, we found benefits for using word segmentation lattices as input, explicit generation of beginning and end of sentence markers, minimum Bayes risk decoding, and incorporation of a feature scoring the alignment of function words in the hypothesized translation. We also explored the use of monolingual paraphrases to improve coverage, as well as co-training to improve the quality of the segmentation lattices used, but these did not lead to improvements.
Statistical Machine Translation with Terminology
This paper considers a scenario which is slightly different from Statistical Machine Translation (SMT) in that we are given almost perfect knowledge about bilingual terminology, considering the situation when a Japanese patent is applied to or granted by the Japanese Patent Office (JPO). Technically, we incorporate bilingual terminology into Phrase-based SMT (PB-SMT) focusing on the statistical properties of them. The first modification is made on the word aligner which incorporates knowledge about terminology as prior knowledge. The second modification is made both on the language modeling and the translation modeling which reflect the hierarchical structure of bilingual terminology, that is the non-compound characteristics of the phrases, using the Pitman-Yor processbased smoothing methods. Using 200k JP-EN NTCIR corpus, our experimental results show that the overall improvement of this method was 1.33 BLEU point absolute and 6.1% relative.