The Word Is Mightier than the Count: Accumulating Translation Resources from Parsed Parallel Corpora (original) (raw)
Abstract
Large, high-quality, sentence aligned parallel corpora are hard to come by, and this makes the Statistical Machine Translation enterprise more difficult. Even noisy corpora can provide useful translation resources not otherwise available though. Many investigations have used statistical methods to find word correspondences. Often such methods suffer from overgeneration, so to correct this we filter relevant translation candidates using a lexical post-process. This dictionary lookup is so effective in fact that it brings into question the value of the statistical methods. Using a dictionary lookup against all combinations of phrase pairs as a baseline, we compare three statistical methods and report the results. The three methods are (1) Mutual Information; (2) Expectation Maximization over word co-occurrence frequencies; and (3) EM over word alignments in every sentence. We also apply the dictionary lookup as a postprocess, to tackle overgeneration.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
- Al Onaizan et al: Statistical Machine Translation, Final Report Johns Hopkins University, Workshop 1999
Google Scholar - Breen, J.W.: Building an Electronic Japanese-English Dictionary Presented at the Japanese Studies Association of Australia Conference, Brisbane, Queensland, Australia, July 1995
Google Scholar - Brown et al: The Mathematics of Machine Translation: Parameter Estimation Computational Linguistics, vol 19, number 2, pp 263–311, 1993
Google Scholar - Charniak, E.: Immediate Head Parsing for Language Models In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, 2001
Google Scholar - Fung, P. and Church, K.W.: K-Vec: A New Approach for Aligning Parallel Texts In Proceedings 15th COLING pp 1096–1102 (1994)
Google Scholar - Kudoh, T. and Matsumoto, Y.: Japanese Dependency Structure Analysis Based on Support Vector Machines In Empirical Methods in Natural Language processing and Very Large Corpora, Pages 18–25, 2000
Google Scholar - Melamed, I.D.: Empirical Methods for Exploiting Parallel Texts The MIT Press, Cambridge Massachussetts, 1998
Google Scholar - Tanaka et al: Speech to Speech Translation System for Monologues-Data Driven Approach ICSLP, Denver Colorado, 2002
Google Scholar
Author information
Authors and Affiliations
- Spoken Language Translation Research Laboratory, ATR, Keihanna, Kyoto, Japan
Stephen Nightingale & Hideki Tanaka
Authors
- Stephen Nightingale
- Hideki Tanaka
Editor information
Editors and Affiliations
- Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN), Col. Zacatenco, CP 07738, Mexico D.F., Mexico
Alexander Gelbukh
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nightingale, S., Tanaka, H. (2003). The Word Is Mightier than the Count: Accumulating Translation Resources from Parsed Parallel Corpora. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2003. Lecture Notes in Computer Science, vol 2588. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36456-0\_45
Download citation
- .RIS
- .ENW
- .BIB
- DOI: https://doi.org/10.1007/3-540-36456-0\_45
- Published: 30 April 2003
- Publisher Name: Springer, Berlin, Heidelberg
- Print ISBN: 978-3-540-00532-2
- Online ISBN: 978-3-540-36456-6
- eBook Packages: Springer Book Archive