Generation in machine translation from deep syntactic trees (original) (raw)

Improve syntax-based translation using deep syntactic structures

Machine Translation, 2010

This paper introduces deep syntactic structures to syntax-based Statistical Machine Translation (SMT). We use a Head-driven Phrase Structure Grammar (HPSG) parser to obtain the deep syntactic structures of a sentence, which include not only a fine-grained syntactic property description but also a semantic representation. Considering the abundant information included in the deep syntactic structures, it is interesting to investigate whether or not they improve the traditional syntax-based translation models based on PCFG parsers. In order to use deep syntactic structures for SMT, this paper focuses on extracting tree-to-string translation rules from aligned HPSG tree-string pairs. The major challenge is to properly localize the non-local relations among nodes in an HPSG tree. To localize the semantic dependencies among words and phrases, which can be inherently non-local, a minimum covering tree is defined by taking a predicate word and its lexical/phrasal arguments as the frontier nodes. Starting from this definition, a linear-time algorithm is proposed to extract translation rules through one-time traversal of the leaf nodes in an HPSG tree. Extensive experiments on a tree-to-string translation system testified the effectiveness of our proposal.

Data-driven sentence generation with non-isomorphic trees

Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015

structures from which the generation naturally starts often do not contain any functional nodes, while surface-syntactic structures or a chain of tokens in a linearized tree contain all of them. Therefore, data-driven linguistic generation needs to be able to cope with the projection between non-isomorphic structures that differ in their topology and number of nodes. So far, such a projection has been a challenge in data-driven generation and was largely avoided. We present a fully stochastic generator that is able to cope with projection between non-isomorphic structures. The generator, which starts from PropBank-like structures, consists of a cascade of SVM-classifier based submodules that map in a series of transitions the input structures onto sentences. The generator has been evaluated for English on the Penn-Treebank and for Spanish on the multi-layered Ancora-UPF corpus.

Recursive Top-Down Production for Sentence Generation with Latent Trees

Findings of the Association for Computational Linguistics: EMNLP 2020, 2020

We model the recursive production property of context-free grammars for natural and synthetic languages. To this end, we present a dynamic programming algorithm that marginalises over latent binary tree structures with N leaves, allowing us to compute the likelihood of a sequence of N tokens under a latent tree model, which we maximise to train a recursive neural function. We demonstrate performance on two synthetic tasks: SCAN (Lake and Baroni, 2017), where it outperforms previous models on the LENGTH split, and English question formation (McCoy et al., 2020), where it performs comparably to decoders with the ground-truth tree structure. We also present experimental results on German-English translation on the Multi30k dataset (Elliott et al., 2016), and qualitatively analyse the induced tree structures our model learns for the SCAN tasks and the German-English translation task.

Modeling Source Syntax for Neural Machine Translation

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Even though a linguistics-free sequence to sequence model in neural machine translation (NMT) has certain capability of implicitly learning syntactic information of source sentences, this paper shows that source syntax can be explicitly incorporated into NMT effectively to provide further improvements. Specifically, we linearize parse trees of source sentences to obtain structural label sequences. On the basis, we propose three different sorts of encoders to incorporate source syntax into NMT: 1) Parallel RNN encoder that learns word and label annotation vectors parallelly; 2) Hierarchical RNN encoder that learns word and label annotation vectors in a two-level hierarchy; and 3) Mixed RNN encoder that stitchingly learns word and label annotation vectors over sequences where words and labels are mixed. Experimentation on Chinese-to-English translation demonstrates that all the three proposed syntactic encoders are able to improve translation accuracy. It is interesting to note that the simplest RNN encoder, i.e., Mixed RNN encoder yields the best performance with an significant improvement of 1.4 BLEU points. Moreover, an in-depth analysis from several perspectives is provided to reveal how source syntax benefits NMT.

Explicit Syntactic Guidance for Neural Text Generation

arXiv (Cornell University), 2023

Most existing text generation models follow the sequence-to-sequence paradigm. Generative Grammar suggests that humans generate natural language texts by learning language grammar. We propose a syntax-guided generation schema, which generates the sequence guided by a constituency parse tree in a topdown direction. The decoding process can be decomposed into two parts: (1) predicting the infilling texts for each constituent in the lexicalized syntax context given the source sentence; (2) mapping and expanding each constituent to construct the next-level syntax context. Accordingly, we propose a structural beam search method to find possible syntax structures hierarchically. Experiments on paraphrase generation and machine translation show that the proposed method outperforms autoregressive baselines, while also demonstrating effectiveness in terms of interpretability, controllability, and diversity.

Towards String to Tree Neural Machine Translation

We present a simple method to incorporate syntactic information about the target language in a neural machine translation system by translating into linearized, lexicalized constituency trees. An experiment on the WMT16 German-English news translation task resulted in an improved BLEU score when compared to a syntax-agnostic NMT baseline trained on the same dataset. An analysis of the translations from the syntax-aware system shows that it performs more reordering during translation in comparison to the baseline. A small-scale human evaluation also showed an advantage to the syntax-aware system.

Deep syntax language models and statistical machine translation

Graham Yvette and Van Genabith Josef Deep Syntax Language Models and Statistical Machine Translation in Ssst 4 4th Workshop on Syntax and Structure in Statistical Translation at Coling 2010 28 August 2010 Beijing China, 2010

Hierarchical Models increase the reordering capabilities of MT systems by introducing non-terminal symbols to phrases that map source language (SL) words/phrases to the correct position in the target language (TL) translation. Building translations via discontiguous TL phrases increases the difficulty of language modeling, however, introducing the need for heuristic techniques such as cube pruning (Chiang, 2005), for example. An additional possibility to aid language modeling in hierarchical systems is to use a language model that models fluency of words not using their local context in the string, as in traditional language models, but instead using the deeper context of a word. In this paper, we explore the potential of deep syntax language models providing an interesting comparison with the traditional string-based language model. We include an experimental evaluation that compares the two kinds of models independently of any MT system to investigate the possible potential of integrating a deep syntax language model into Hierarchical SMT systems.

Improving Language Generation from Feature-Rich Tree-Structured Data with Relational Graph Convolutional Encoders

Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)

The Multilingual Surface Realization Shared Task 2019 focuses on generating sentences from lemmatized sets of universal dependency parses with rich features. This paper describes the system design and the results of our participation in the deep track. The core innovation in our approach is to use a graph convolutional network to encode the dependency trees given as input. Upon adding morphological features, our system achieves the second rank in the deep track without using data augmentation techniques or additional components (such as a re-ranker).

The UOT system: improve string-to-tree translation using head-driven phrase structure grammar and predicate-argument structures

We present the UOT Machine Translation System that was used in the IWSLT-09 evaluation campaign. This year, we participated in the BTEC track for Chinese-to-English translation. Our system is based on a string-to-tree framework. To integrate deep syntactic information, we propose the use of parse trees and semantic dependencies on English sentences described respectively by Head-driven Phrase Structure Grammar and Predicate-Argument Structures. We report the results of our system on both the development and test sets.

Automatically generated parallel treebanks and their exploitability in machine translation

Machine Translation, 2009

Given much recent discussion and the shift in focus of the field, it is becoming apparent that the incorporation of syntax is the way forward for improvements to the current state-of-the-art in machine translation (MT). Parallel treebanks are a relatively recent innovation and appear to be ideal candidates for MT training material. However, until recently there has been no other means to build them than by hand. In this paper, we describe how we make use of new tools to automatically build a large parallel treebank and extract a set of linguistically-motivated phrase pairs from it. We show that adding these phrase pairs to the translation model of a baseline phrase-based statistical MT (PB-SMT) system leads to significant improvements in translation quality. Following this, we describe experiments in which we exploit the information encoded in the parallel treebank in other areas of the PB-SMT framework, while investigating the conditions under which the incorporation of parallel treebank data performs optimally. Finally, we discuss the possibility of exploiting automatically-generated parallel treebanks further in syntax-aware paradigms of MT.