Yuqing Guo - Profile on Academia.edu (original) (raw)
Papers by Yuqing Guo
An important element in question answering systems is the analysis and interpretation of question... more An important element in question answering systems is the analysis and interpretation of questions. Using the NTCIR 5 Cross-Language Question Answering (CLQA) question test set we demonstrate that the accuracy of deep question analysis is dependent on the quantity and suitability of the available linguistic resources. We further demonstrate that applying question analysis tools developed on monolingual training materials to questions translated Chinese-English and English-Chinese using machine translation produces much reduced effectiveness in interpretation of the question. This latter result indicates that question analysis for CLQA should primarily be conducted in the question language prior to translation.
To date, work on Non-Local Dependencies (NLDs) has focused almost exclusively on English and it i... more To date, work on Non-Local Dependencies (NLDs) has focused almost exclusively on English and it is an open research question how well these approaches migrate to other languages. This paper surveys non-local dependency constructions in Chinese as represented in the Penn Chinese Treebank (CTB) and provides an approach for generating proper predicate-argument-modifier structures including NLDs from surface contextfree phrase structure trees. Our approach recovers non-local dependencies at the level of Lexical-Functional Grammar f-structures, using automatically acquired subcategorisation frames and f-structure paths linking antecedents and traces in NLDs. Currently our algorithm achieves 92.2% f-score for trace insertion and 84.3% for antecedent recovery evaluating on gold-standard CTB trees, and 64.7% and 54.7%, respectively, on CTBtrained state-of-the-art parser output trees.
Guo Yuqing Treebank Based Acquisition of Chinese Lfg Resources For Parsing and Generation Phd Thesis Dublin City University, Nov 1, 2009
This thesis describes a treebank-based approach to automatically acquire robust, wide-coverage Le... more This thesis describes a treebank-based approach to automatically acquire robust, wide-coverage Lexical-Functional Grammar (LFG) resources for Chinese parsing and generation, which is part of a larger project on the rapid construction of deep, large-scale, constraint-based, multilingual grammatical resources. I present an application-oriented LFG analysis for Chinese core linguistic phenomena and (in cooperation with PARC) develop a gold-standard dependency-bank of Chinese f-structures for evaluation. Based on the Penn Chinese Treebank, I design and implement two architectures for inducing Chinese LFG resources, one annotation-based and the other dependency conversion-based. I then apply the fstructure acquisition algorithm together with external, state-of-the-art parsers to parsing new text into "proto" f-structures. In order to convert "proto" f-structures into "proper" f-structures or deep dependencies, I present a novel Non-Local Dependency (NLD) recovery algorithm using subcategorisation frames and f-structure paths linking antecedents and traces in NLDs extracted from the automatically-built LFG f-structure treebank. Based on the grammars extracted from the f-structureannotated treebank, I develop a PCFG-based chart generator and a new n-grambased pure dependency generator to realise Chinese sentences from LFG f-structures. The work reported in this thesis is the first effort to scale treebank-based, probabilistic Chinese LFG resources from proof-of-concept research to unrestricted, real text. Although this thesis concentrates on Chinese and LFG, many of the methodologies, e.g. the acquisition of predicate-argument structures, NLD resolution and the PCFG-and dependency n-gram-based generation models, are largely language and formalism independent and should generalise to diverse languages as well as to labelled bilexical dependency representations other than LFG.
Proceedings of the 13th European Workshop on Natural Language Generation, Sep 28, 2011
In this paper we describe our system and experimental results on the development set of the Surfa... more In this paper we describe our system and experimental results on the development set of the Surface Realisation Shared Task. DCU submitted 1-best outputs for the Shallow subtask of the shared task, using a surface realisation technique based on dependency-based n-gram models. The surface realiser achieved BLEU and NIST scores of 0.8615 and 13.6841 respectively on the SR development set. * Throughout this document DCU stands for the joint team of Dublin City University and Toshiba (China) Research and Development Center participating in the SR Task 2011.
China-Ireland International Conference on Information and Communications Technologies (CIICT 2007), 2007
Question-answering (QA) is a next-generation search technology which aims to provide answers to a... more Question-answering (QA) is a next-generation search technology which aims to provide answers to a user's question from a collection of documents. Cross-Language QA (CLQA) extends this paradigm to answering questions from a collection in a different language to the question itself. The accuracy with which a CLQA system answers questions depends on the QA system and translation between the question and the information source. We report results from an evaluation of English-Chinese CLQA comparing question translation using standard machine translation systems and extended translation incorporating web mining to enhance the translation dictionary against a baseline of monolingual Chinese QA. Results from these experiments show that our noun phrase recognition and translation techniques lead to a significant improvement in CLQA effectiveness. Moreover, the syntactic form of a question can be impaired during query translation, and thus potentially degrades the overall CLQA system performance.
Proceedings of the Fifth International Natural Language Generation Conference on - INLG '08, 2008
We describe three PCFG-based models for Chinese sentence realisation from Lexical-Functional Gram... more We describe three PCFG-based models for Chinese sentence realisation from Lexical-Functional Grammar (LFG) f-structures. Both the lexicalised model and the history-based model improve on the accuracy of a simple wide-coverage PCFG model by adding lexical and contextual information to weaken inappropriate independence assumptions implicit in the PCFG models. In addition, we provide techniques for lexical smoothing and rule smoothing to increase the generation coverage. Trained on 15,663 automatically LFG fstructure annotated sentences of the Penn Chinese treebank and tested on 500 sentences randomly selected from the treebank test set, the lexicalised model achieves a BLEU score of 0.7265 at 100% coverage, while the historybased model achieves a BLEU score of 0.7245 also at 100% coverage.
Method and apparatus for training transliteration model and parsing statistic model, method and apparatus for transliteration
Study on semantic paragraph partition in automatic abstracting system
2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236), 2001
Semantic paragraph partition is an important problem in text structure analysis in an automatic a... more Semantic paragraph partition is an important problem in text structure analysis in an automatic abstracting system. For an article containing distinct headings, the paper presents heading models in Chinese text to divide an article into semantic paragraphs based on the recognition of headings. For an article not containing headings, the paper establishes a vector space model for the whole article
This paper describes log-linear models for a general-purpose sentence realizer based on de- pende... more This paper describes log-linear models for a general-purpose sentence realizer based on de- pendency structures. Unlike traditional realiz- ers using grammar rules, our method realizes sentences by linearizing dependency relations directly in two steps. First, the relative order between head and each dependent is deter- mined by their dependency relation. Then the best linearizations compatible with the relative order are selected by log-linear models. The log-linear models incorporate three types of feature functions, including dependency rela- tions, surface words and headwords. Our ap- proach to sentence realization provides sim- plicity, efficiency and competitive accuracy. Trained on 8,975 dependency structures of a Chinese Dependency Treebank, the realizer achieves a BLEU score of 0.8874.
We describe three PCFG-based models for Chinese sentence realisation from Lexical-Functional Gram... more We describe three PCFG-based models for Chinese sentence realisation from Lexical-Functional Grammar (LFG) f-structures. Both the lexicalised model and the history-based model improve on the accuracy of a simple wide-coverage PCFG model by adding lexical and contextual information to weaken inappropriate independence assumptions implicit in the PCFG models. In addition, we provide techniques for lexical smoothing and rule smoothing to increase the generation coverage. Trained on 15,663 automatically LFG fstructure annotated sentences of the Penn Chinese treebank and tested on 500 sentences randomly selected from the treebank test set, the lexicalised model achieves a BLEU score of 0.7265 at 100% coverage, while the historybased model achieves a BLEU score of 0.7245 also at 100% coverage.
To date, work on Non-Local Dependencies (NLDs) has focused almost exclusively on English and it i... more To date, work on Non-Local Dependencies (NLDs) has focused almost exclusively on English and it is an open research question how well these approaches migrate to other languages. This paper surveys non-local dependency constructions in Chinese as represented in the Penn Chinese Treebank (CTB) and provides an approach for generating proper predicate-argument-modifier structures including NLDs from surface contextfree phrase structure trees. Our approach recovers non-local dependencies at the level of Lexical-Functional Grammar f-structures, using automatically acquired subcategorisation frames and f-structure paths linking antecedents and traces in NLDs. Currently our algorithm achieves 92.2% f-score for trace insertion and 84.3% for antecedent recovery evaluating on gold-standard CTB trees, and 64.7% and 54.7%, respectively, on CTBtrained state-of-the-art parser output trees.
This paper presents a method to automatically acquire wide-coverage, robust, probabilistic Lexica... more This paper presents a method to automatically acquire wide-coverage, robust, probabilistic Lexical-Functional Grammar resources for Chinese from the Penn Chinese Treebank (CTB). Our starting point is the earlier, proofof-concept work of (Burke et al., 2004) on automatic f-structure annotation, LFG grammar acquisition and parsing for Chinese using the CTB version 2 (CTB2). We substantially extend and improve on this earlier research as regards coverage, robustness, quality and fine-grainedness of the resulting LFG resources. We achieve this through (i) improved LFG analyses for a number of core Chinese phenomena; (ii) a new automatic f-structure annotation architecture which involves an intermediate dependency representation; (iii) scaling the approach from 4.1K trees in CTB2 to 18.8K trees in CTB version 5.1 (CTB5.1) and (iv) developing a novel treebank-based approach to recovering non-local dependencies (NLDs) for Chinese parser output. Against a new 200-sentence good standard of manually constructed f-structures, the method achieves 96.00% f-score for f-structures automatically generated for the original CTB trees and 80.01% for NLD-recovered f-structures generated for the trees output by Bikel's parser. 2 Automatic F-Structure Annotation of CTB5.1 2.1 Chinese LFG Research on LFG has provided analyses for a considerable number of linguistic phenomena in Indo-European, Asian, African and Native American and Australian languages. However, to date, there has been no standard LFG account for many of the core phenomena of Chinese, a language drastically different from English, German, French and other Indo-European languages, which are often the focus of
A Linguistically Inspired Statistical Model for Chinese Punctuation Generation
ACM Transactions on Asian Language Information Processing, 2010
This article investigates a relatively underdeveloped subject in natural language processing---th... more This article investigates a relatively underdeveloped subject in natural language processing---the generation of punctuation marks. From a theoretical perspective, we study 16 Chinese punctuation marks as defined in the Chinese national standard of punctuation usage, and categorize these punctuation marks into three different types according to their syntactic properties. We implement a three-tier maximum entropy model incorporating linguistically-motivated features for generating the commonly used Chinese punctuation marks in unpunctuated sentences output by a surface realizer. Furthermore, we present a method to automatically extract cue words indicating sentence-final punctuation marks as a specialized feature to construct a more precise model. Evaluating on the Penn Chinese Treebank data, the MaxEnt model achieves anf-score of 79.83% for punctuation insertion and 74.61% for punctuation restoration using gold data input, 79.50% for insertion and 73.32% for restoration using parse...
We present dependency-based n-gram models for general-purpose, widecoverage, probabilistic senten... more We present dependency-based n-gram models for general-purpose, widecoverage, probabilistic sentence realisation. Our method linearises unordered dependencies in input representations directly rather than via the application of grammar rules, as in traditional chartbased generators. The method is simple, efficient, and achieves competitive accuracy and complete coverage on standard English (Penn-II, 0.7440 BLEU, 0.05 sec/sent) and Chinese (CTB6, 0.7123 BLEU, 0.14 sec/sent) test data.
An important element in question answering systems is the analysis and interpretation of question... more An important element in question answering systems is the analysis and interpretation of questions. Using the NTCIR 5 Cross-Language Question Answering (CLQA) question test set we demonstrate that the accuracy of deep question analysis is dependent on the quantity and suitability of the available linguistic resources. We further demonstrate that applying question analysis tools developed on monolingual training materials to questions translated Chinese-English and English-Chinese using machine translation produces much reduced effectiveness in interpretation of the question. This latter result indicates that question analysis for CLQA should primarily be conducted in the question language prior to translation.
To date, work on Non-Local Dependencies (NLDs) has focused almost exclusively on English and it i... more To date, work on Non-Local Dependencies (NLDs) has focused almost exclusively on English and it is an open research question how well these approaches migrate to other languages. This paper surveys non-local dependency constructions in Chinese as represented in the Penn Chinese Treebank (CTB) and provides an approach for generating proper predicate-argument-modifier structures including NLDs from surface contextfree phrase structure trees. Our approach recovers non-local dependencies at the level of Lexical-Functional Grammar f-structures, using automatically acquired subcategorisation frames and f-structure paths linking antecedents and traces in NLDs. Currently our algorithm achieves 92.2% f-score for trace insertion and 84.3% for antecedent recovery evaluating on gold-standard CTB trees, and 64.7% and 54.7%, respectively, on CTBtrained state-of-the-art parser output trees.
Guo Yuqing Treebank Based Acquisition of Chinese Lfg Resources For Parsing and Generation Phd Thesis Dublin City University, Nov 1, 2009
This thesis describes a treebank-based approach to automatically acquire robust, wide-coverage Le... more This thesis describes a treebank-based approach to automatically acquire robust, wide-coverage Lexical-Functional Grammar (LFG) resources for Chinese parsing and generation, which is part of a larger project on the rapid construction of deep, large-scale, constraint-based, multilingual grammatical resources. I present an application-oriented LFG analysis for Chinese core linguistic phenomena and (in cooperation with PARC) develop a gold-standard dependency-bank of Chinese f-structures for evaluation. Based on the Penn Chinese Treebank, I design and implement two architectures for inducing Chinese LFG resources, one annotation-based and the other dependency conversion-based. I then apply the fstructure acquisition algorithm together with external, state-of-the-art parsers to parsing new text into "proto" f-structures. In order to convert "proto" f-structures into "proper" f-structures or deep dependencies, I present a novel Non-Local Dependency (NLD) recovery algorithm using subcategorisation frames and f-structure paths linking antecedents and traces in NLDs extracted from the automatically-built LFG f-structure treebank. Based on the grammars extracted from the f-structureannotated treebank, I develop a PCFG-based chart generator and a new n-grambased pure dependency generator to realise Chinese sentences from LFG f-structures. The work reported in this thesis is the first effort to scale treebank-based, probabilistic Chinese LFG resources from proof-of-concept research to unrestricted, real text. Although this thesis concentrates on Chinese and LFG, many of the methodologies, e.g. the acquisition of predicate-argument structures, NLD resolution and the PCFG-and dependency n-gram-based generation models, are largely language and formalism independent and should generalise to diverse languages as well as to labelled bilexical dependency representations other than LFG.
Proceedings of the 13th European Workshop on Natural Language Generation, Sep 28, 2011
In this paper we describe our system and experimental results on the development set of the Surfa... more In this paper we describe our system and experimental results on the development set of the Surface Realisation Shared Task. DCU submitted 1-best outputs for the Shallow subtask of the shared task, using a surface realisation technique based on dependency-based n-gram models. The surface realiser achieved BLEU and NIST scores of 0.8615 and 13.6841 respectively on the SR development set. * Throughout this document DCU stands for the joint team of Dublin City University and Toshiba (China) Research and Development Center participating in the SR Task 2011.
China-Ireland International Conference on Information and Communications Technologies (CIICT 2007), 2007
Question-answering (QA) is a next-generation search technology which aims to provide answers to a... more Question-answering (QA) is a next-generation search technology which aims to provide answers to a user's question from a collection of documents. Cross-Language QA (CLQA) extends this paradigm to answering questions from a collection in a different language to the question itself. The accuracy with which a CLQA system answers questions depends on the QA system and translation between the question and the information source. We report results from an evaluation of English-Chinese CLQA comparing question translation using standard machine translation systems and extended translation incorporating web mining to enhance the translation dictionary against a baseline of monolingual Chinese QA. Results from these experiments show that our noun phrase recognition and translation techniques lead to a significant improvement in CLQA effectiveness. Moreover, the syntactic form of a question can be impaired during query translation, and thus potentially degrades the overall CLQA system performance.
Proceedings of the Fifth International Natural Language Generation Conference on - INLG '08, 2008
We describe three PCFG-based models for Chinese sentence realisation from Lexical-Functional Gram... more We describe three PCFG-based models for Chinese sentence realisation from Lexical-Functional Grammar (LFG) f-structures. Both the lexicalised model and the history-based model improve on the accuracy of a simple wide-coverage PCFG model by adding lexical and contextual information to weaken inappropriate independence assumptions implicit in the PCFG models. In addition, we provide techniques for lexical smoothing and rule smoothing to increase the generation coverage. Trained on 15,663 automatically LFG fstructure annotated sentences of the Penn Chinese treebank and tested on 500 sentences randomly selected from the treebank test set, the lexicalised model achieves a BLEU score of 0.7265 at 100% coverage, while the historybased model achieves a BLEU score of 0.7245 also at 100% coverage.
Method and apparatus for training transliteration model and parsing statistic model, method and apparatus for transliteration
Study on semantic paragraph partition in automatic abstracting system
2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236), 2001
Semantic paragraph partition is an important problem in text structure analysis in an automatic a... more Semantic paragraph partition is an important problem in text structure analysis in an automatic abstracting system. For an article containing distinct headings, the paper presents heading models in Chinese text to divide an article into semantic paragraphs based on the recognition of headings. For an article not containing headings, the paper establishes a vector space model for the whole article
This paper describes log-linear models for a general-purpose sentence realizer based on de- pende... more This paper describes log-linear models for a general-purpose sentence realizer based on de- pendency structures. Unlike traditional realiz- ers using grammar rules, our method realizes sentences by linearizing dependency relations directly in two steps. First, the relative order between head and each dependent is deter- mined by their dependency relation. Then the best linearizations compatible with the relative order are selected by log-linear models. The log-linear models incorporate three types of feature functions, including dependency rela- tions, surface words and headwords. Our ap- proach to sentence realization provides sim- plicity, efficiency and competitive accuracy. Trained on 8,975 dependency structures of a Chinese Dependency Treebank, the realizer achieves a BLEU score of 0.8874.
We describe three PCFG-based models for Chinese sentence realisation from Lexical-Functional Gram... more We describe three PCFG-based models for Chinese sentence realisation from Lexical-Functional Grammar (LFG) f-structures. Both the lexicalised model and the history-based model improve on the accuracy of a simple wide-coverage PCFG model by adding lexical and contextual information to weaken inappropriate independence assumptions implicit in the PCFG models. In addition, we provide techniques for lexical smoothing and rule smoothing to increase the generation coverage. Trained on 15,663 automatically LFG fstructure annotated sentences of the Penn Chinese treebank and tested on 500 sentences randomly selected from the treebank test set, the lexicalised model achieves a BLEU score of 0.7265 at 100% coverage, while the historybased model achieves a BLEU score of 0.7245 also at 100% coverage.
To date, work on Non-Local Dependencies (NLDs) has focused almost exclusively on English and it i... more To date, work on Non-Local Dependencies (NLDs) has focused almost exclusively on English and it is an open research question how well these approaches migrate to other languages. This paper surveys non-local dependency constructions in Chinese as represented in the Penn Chinese Treebank (CTB) and provides an approach for generating proper predicate-argument-modifier structures including NLDs from surface contextfree phrase structure trees. Our approach recovers non-local dependencies at the level of Lexical-Functional Grammar f-structures, using automatically acquired subcategorisation frames and f-structure paths linking antecedents and traces in NLDs. Currently our algorithm achieves 92.2% f-score for trace insertion and 84.3% for antecedent recovery evaluating on gold-standard CTB trees, and 64.7% and 54.7%, respectively, on CTBtrained state-of-the-art parser output trees.
This paper presents a method to automatically acquire wide-coverage, robust, probabilistic Lexica... more This paper presents a method to automatically acquire wide-coverage, robust, probabilistic Lexical-Functional Grammar resources for Chinese from the Penn Chinese Treebank (CTB). Our starting point is the earlier, proofof-concept work of (Burke et al., 2004) on automatic f-structure annotation, LFG grammar acquisition and parsing for Chinese using the CTB version 2 (CTB2). We substantially extend and improve on this earlier research as regards coverage, robustness, quality and fine-grainedness of the resulting LFG resources. We achieve this through (i) improved LFG analyses for a number of core Chinese phenomena; (ii) a new automatic f-structure annotation architecture which involves an intermediate dependency representation; (iii) scaling the approach from 4.1K trees in CTB2 to 18.8K trees in CTB version 5.1 (CTB5.1) and (iv) developing a novel treebank-based approach to recovering non-local dependencies (NLDs) for Chinese parser output. Against a new 200-sentence good standard of manually constructed f-structures, the method achieves 96.00% f-score for f-structures automatically generated for the original CTB trees and 80.01% for NLD-recovered f-structures generated for the trees output by Bikel's parser. 2 Automatic F-Structure Annotation of CTB5.1 2.1 Chinese LFG Research on LFG has provided analyses for a considerable number of linguistic phenomena in Indo-European, Asian, African and Native American and Australian languages. However, to date, there has been no standard LFG account for many of the core phenomena of Chinese, a language drastically different from English, German, French and other Indo-European languages, which are often the focus of
A Linguistically Inspired Statistical Model for Chinese Punctuation Generation
ACM Transactions on Asian Language Information Processing, 2010
This article investigates a relatively underdeveloped subject in natural language processing---th... more This article investigates a relatively underdeveloped subject in natural language processing---the generation of punctuation marks. From a theoretical perspective, we study 16 Chinese punctuation marks as defined in the Chinese national standard of punctuation usage, and categorize these punctuation marks into three different types according to their syntactic properties. We implement a three-tier maximum entropy model incorporating linguistically-motivated features for generating the commonly used Chinese punctuation marks in unpunctuated sentences output by a surface realizer. Furthermore, we present a method to automatically extract cue words indicating sentence-final punctuation marks as a specialized feature to construct a more precise model. Evaluating on the Penn Chinese Treebank data, the MaxEnt model achieves anf-score of 79.83% for punctuation insertion and 74.61% for punctuation restoration using gold data input, 79.50% for insertion and 73.32% for restoration using parse...
We present dependency-based n-gram models for general-purpose, widecoverage, probabilistic senten... more We present dependency-based n-gram models for general-purpose, widecoverage, probabilistic sentence realisation. Our method linearises unordered dependencies in input representations directly rather than via the application of grammar rules, as in traditional chartbased generators. The method is simple, efficient, and achieves competitive accuracy and complete coverage on standard English (Penn-II, 0.7440 BLEU, 0.05 sec/sent) and Chinese (CTB6, 0.7123 BLEU, 0.14 sec/sent) test data.