Morphological segmentation method for Turkic language neural machine translation (original) (raw)
Related papers
2022
Morphologically-rich polysynthetic languages present a challenge for NLP systems due to data sparsity, and a common strategy to handle this issue is to apply subword segmentation. We investigate a wide variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages: Nahuatl, Raramuri, Shipibo-Konibo, and Wixarika. Then, we compare the morphologically inspired segmentation methods against Byte-Pair Encodings (BPEs) as inputs for machine translation (MT) when translating to and from Spanish. We show that for all language pairs except for Nahuatl, an unsupervised morphological segmentation algorithm outperforms BPEs consistently and that, although supervised methods achieve better segmentation scores, they under-perform in MT challenges. Finally, we contribute two new morphological segmentation datasets for Raramuri and Shipibo-Konibo, and a parallel corpus for Raramuri--Spanish.
Hybrid Morphological Segmentation for Phrase-Based Machine Translation
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, 2016
This article describes the Aalto University entry to the English-to-Finnish news translation shared task in WMT 2016. Our segmentation method combines the strengths of rule-based and unsupervised morphology. We also attempt to correct errors in the boundary markings by post-processing with a neural morph boundary predictor.
2014
We present a novel segmentation approach for Phrase-Based Statistical Machine Translation (PB-SMT) to languages where word boundaries are not obviously marked by using both monolingual and bilingual information and demonstrate that (1) unsegmented corpus is able to provide the nearly identical result compares to manually segmented corpus in PB-SMT task when a good heuristic character clustering algorithm is applied on it, (2) the performance of PB-SMT task has significantly increased when bilingual information are used on top of monolingual segmented result. Our technique, instead of focusing on word separation, mainly concentrate on a group of character. First, we group several characters that reside in an unsegmented corpus by employing predetermined constraints and certain heuristics algorithms. Secondly, we enhance the segmented result by incorporating the character group repacking based on alignment confidence. We evaluate the effectiveness of our method on PB-SMT task using En...
Neural machine translation system for the Kazakh language based on synthetic corpora
MATEC Web of Conferences
The lack of big parallel data is present for the Kazakh language. This problem seriously impairs the quality of machine translation from and into Kazakh. This article considers the neural machine translation of the Kazakh language on the basis of synthetic corpora. The Kazakh language belongs to the Turkic languages, which are characterised by rich morphology. Neural machine translation of natural languages requires large training data. The article will show the model for the creation of synthetic corpora, namely the generation of sentences based on complete suffixes for the Kazakh language. The novelty of this approach of the synthetic corpora generation for the Kazakh language is the generation of sentences on the basis of the complete system of suffixes of the Kazakh language. By using generated synthetic corpora we are improving the translation quality in neural machine translation of Kazakh-English and Kazakh-Russian pairs.
Development of Morphological Segmentation for the Kyrgyz Language on Complete Set of Endings
Intelligent Information and Database Systems, 2021
Abstract. Old Turkic language is the basis of all modern Turkic languages. Its study is very important for Turkic peoples who possess modern Turkic languages. This is important both from a historical point of view and for the study of modern issues of neural machine translation, issues of the linguistic distance of modern Turkic languages from their progenitor. This paper proposes the development of a computational model of the morphology of Old Turkic language based on the CSE (Complete Set of Endings) – model of morphology and a study on this basis of the issue of morphological segmentation of the texts of Old Turkic language, which will subsequently be used for neural machine translation of Old Turkic language into modern Turkic languages. Since most of the modern Turkic languages, except for the Turkish language, belong to low-resource languages, the issues of developing computational models of morphology, developing models, algorithms and software for processing Turkic languages are relevant.
The neural machine translation models for the low-resource Kazakh–English language pair
PeerJ Computer Science
The development of the machine translation field was driven by people’s need to communicate with each other globally by automatically translating words, sentences, and texts from one language into another. The neural machine translation approach has become one of the most significant in recent years. This approach requires large parallel corpora not available for low-resource languages, such as the Kazakh language, which makes it difficult to achieve the high performance of the neural machine translation models. This article explores the existing methods for dealing with low-resource languages by artificially increasing the size of the corpora and improving the performance of the Kazakh–English machine translation models. These methods are called forward translation, backward translation, and transfer learning. Then the Sequence-to-Sequence (recurrent neural network and bidirectional recurrent neural network) and Transformer neural machine translation architectures with their featur...
Morphology-aware Word-Segmentation in Dialectal Arabic Adaptation of Neural Machine Translation
Proceedings of the Fourth Arabic Natural Language Processing Workshop, 2019
Parallel corpora available for building machine translation (MT) models for dialectal Arabic (DA) are rather limited. The scarcity of resources has prompted the use of Modern Standard Arabic (MSA) abundant resources to complement the limited dialectal resource. However, clitics often differ between MSA and DA. This paper compares morphologyaware DA word segmentation to other word segmentation approaches like Byte Pair Encoding (BPE) and Sub-word Regularization (SR). A set of experiments conducted on Egyptian Arabic (EA), Levantine Arabic (LA), and Gulf Arabic (GA) show that a sufficiently accurate morphology-aware segmentation used in conjunction with BPE or SR outperforms the other word segmentation approaches.
Proceedings of the Fifth Workshop on South and Southeast Asian Natural Language Processing, 2014
We present a novel segmentation approach for Phrase-Based Statistical Machine Translation (PB-SMT) to languages where word boundaries are not obviously marked by using both monolingual and bilingual information and demonstrate that (1) unsegmented corpus is able to provide the nearly identical result compares to manually segmented corpus in PB-SMT task when a good heuristic character clustering algorithm is applied on it, (2) the performance of PB-SMT task has significantly increased when bilingual information are used on top of monolingual segmented result. Our technique, instead of focusing on word separation, mainly concentrate on character clustering. First, we cluster each character from the unsegmented monolingual corpus by employing character co-occurrence statistics and orthographic insight. Secondly, we enhance the segmented result by incorporating the bilingual information which are character cluster alignment, co-occurrence frequency and alignment confidence into that result. We evaluate the effectiveness of our method on PB-SMT task using English-Thai language pair and report the best improvement of 8.1% increase in BLEU score. There are two main advantages of our approach. First, our method requires less effort on developing the corpus and can be applied to unsegmented corpus or poor-quality manually segmented corpus. Second, this technique does not only limited to specific language pair but also capable of automatically adjust the character cluster boundaries to be suitable for other language pairs.
Minimally Supervised Morphological Segmentation with Applications to Machine Translation
2006
Inflected languages in a low-resource setting present a data sparsity problem for statistical machine translation. In this paper, we present a minimally supervised algorithm for morpheme segmentation on Arabic dialects which reduces unknown words at translation time by over 50%, total vocabulary size by over 40%, and yields a significant increase in BLEU score over a previous state-of-theart phrase-based statistical MT system.
Linguistically Motivated Unsupervised Segmentation for Machine Translation
In this paper we use statistical machine translation and morphology information from two different morphological analyzers to try to improve translation quality by linguistically motivated segmentation. The morphological analyzers we use are the unsupervised Morfessor morpheme segmentation and analyzer toolkit and the rule-based morphological analyzer T3. Our translations are done using the Moses statistical machine translation toolkit with training on the JRC-Acquis corpora and translating on Estonian to English and English to Estonian language directions. In our work we model such linguistic phenomena as word lemmas and endings and splitting compound words into simpler parts. Also lemma information was used to introduce new factors to the corpora and to use this information for better word alignment or for alternative path back-off translation. From the results we find that even though these methods have shown previously and keep showing promise of improved translation, their succ...