Advaith Siddharthan | University of Aberdeen (original) (raw)

Papers by Advaith Siddharthan

Research paper thumbnail of Investigation into Human Preference between Common and Unambiguous Lexical Substitutions

abdn.ac.uk

We present a study that investigates that factors that determine what makes a good lexical substi... more We present a study that investigates that factors that determine what makes a good lexical substitution. We begin by observing that there is a correlation between the corpus frequency of words and the number of WordNet senses they have, and hypothesise that readers might prefer common, but more ambiguous words over less ambiguous but also less common ones. We identify four properties of a word that determine whether it is a suitable substitution in a given context, and ask volunteers to rank their preferences between two common but ambiguous lexical substitutions, and two uncommon but also unambiguous ones. Preliminary results suggest a slight preference towards the unambiguous.

Research paper thumbnail of Semantic Annotation of Multilingual Text Corpora

cfar.umd.edu

This paper describes a multi-site project to annotate six sizable bilingual parallel corpora for ... more This paper describes a multi-site project to annotate six sizable bilingual parallel corpora for interlingual content. After presenting the background and objectives of the effort, we describe the data set that is being annotated, the interlingua representation language used, an interface environment that supports the annotation task and the annotation process itself. We then present our evaluation methodology and conclude with a summary of the current status of the project along with a number of issues which have arisen.

Research paper thumbnail of IAMTC Report: Background for IAMTC Interlingua and Representational Comparison for Task G

Research paper thumbnail of Tell me a story about the birds and the bees: Using NLG to foster public engagement in nature conservation projects

For many important nature conservation programmes, western societies are increasingly reliant on ... more For many important nature conservation programmes, western societies are increasingly reliant on the activities of volunteers, who, collectively, have come to represent an unpaid work force of considerable size and importance. Although a variety of effective ways exist to generate adequate recruitment, volunteer retention is harder to achieve, particularly when schemes grow bigger or tasks get more difficult. We describe two case studies that we are using to investigate the hypothesis that richness of information provision, of the kind that can be provided by Natural Language Generation (NLG), can play a role in fostering volunteer interest and motivation. Both these case studies involve collaboration with large existing conservation projects, which provide the possibility for evaluation on a realistic scale.

Research paper thumbnail of Evaluating an Open Domain GRE algorithm on closed domains System IDs: CAM-B, CAM-T, CAM-BU and CAM-TU

csd.abdn.ac.uk

We present four variations of our 2004 incremental algorithm , and present results on both the Fu... more We present four variations of our 2004 incremental algorithm , and present results on both the Furniture and People datasets.

Research paper thumbnail of Inderjeet Mani and Mark T. Maybury (eds). Advances in Automatic Text Summarization. MIT Press, 1999. ISBN 0-262-13359-8. 442 pp. $47.95/£ 32.95 (paperback).

Natural Language Engineering, Jan 1, 2001

CJO Search Widget (Natural Language Engineering) What is this? ... Download a branded Cambridge J... more CJO Search Widget (Natural Language Engineering) What is this? ... Download a branded Cambridge Journals Online toolbar (for IE 7 only). What is this? ... Add Cambridge Journals Online as a search option in your browser toolbar. What is this? ... Inderjeet Mani and Mark T. ...

Research paper thumbnail of Generating research websites using summarisation techniques

Proceedings of the 46th Annual Meeting …, Jan 1, 2008

Research paper thumbnail of Ehud Reiter and Robert Dale. Building Natural Language Generation Systems. Cambridge University Press, 2000. $64.95/£ 37.50 (Hardback). 234 pages

Natural Language Engineering, Jan 1, 2001

... ADVAITH SIDDHARTHAN a1 a1 University of Cambridge e-mail: as372@cl.cam.ac.uk, ... 234 pages.A... more ... ADVAITH SIDDHARTHAN a1 a1 University of Cambridge e-mail: as372@cl.cam.ac.uk, ... 234 pages.ADVAITH SIDDHARTHAN (2001) Natural Language Engineering, Volume 7, Issue 03, September 2001 pp 271-274 http://journals.cambridge.org/abstract_S1351324901212704. ...

Research paper thumbnail of Intelligent Information Access from Scientific Papers

Current Challenges in …, Jan 1, 2011

We describe a novel search engine for scientific literature. The system allows for sentence-level... more We describe a novel search engine for scientific literature. The system allows for sentence-level search starting from portable document format (PDF) files, and integrates text and image search, thus facilitating the retrieval of information present in tables and figures. It allows the user to generate in an intuitive manner complex queries for search terms that are related through particular grammatical (and thus implicitly semantic) relations. The system uses grid processing to parallelise the analysis of large numbers of scientific papers. It is currently undergoing user evaluation, but we report some preliminary evaluation and comparison with Google Scholar, demonstrating its utility. Finally, we discuss future work and the potential and complimentarity of the system for patent search.

Research paper thumbnail of Text Simplification using Typed Dependencies: A Comparison of the Robustness of Different Generation Strategies

abdn.ac.uk

We present a framework for text simplification based on applying transformation rules to a typed ... more We present a framework for text simplification based on applying transformation rules to a typed dependency representation produced by the Stanford parser. We test two approaches to regeneration from typed dependencies: (a) gen-light, where the transformed dependency graphs are linearised using the word order and morphology of the original sentence, with any changes coded into the transformation rules, and (b) gen-heavy, where the Stanford dependencies are reduced to a DSyntS representation and sentences are generating formally using the RealPro surface realiser. The main contribution of this paper is to compare the robustness of these approaches in the presence of parsing errors, using both a single parse and an n-best parse setting in an overgenerate and rank approach. We find that the gen-light approach is robust to parser error, particularly in the n-best parse setting. On the other hand, parsing errors cause the realiser in the genheavy approach to order words and phrases in ways that are disliked by our evaluators.

Research paper thumbnail of Resolving Pronouns Robustly

Research paper thumbnail of Christopher D. Manning and Hinrich Schutze. Foundations of Statistical Natural Language Processing. MIT Press, 2000. ISBN 0-262-13360-1. 620 pp. $64.95/£ 44.95  …

Natural Language Engineering, Jan 1, 2002

Digital Library logo Take a look at the new version of this page: [ beta version ]. Tell us what ... more Digital Library logo Take a look at the new version of this page: [ beta version ]. Tell us what you think. ... Christopher D. Manning and Hinrich Schutze. Foundations of Statistical Natural Language Processing. MIT Press, 2000. ISBN 0-262-13360-1. 620 pp. $64.95/£44.95 (cloth).

Research paper thumbnail of Camtology: intelligent information access for science

Proceedings of the …, Jan 1, 2010

We describe a novel semantic search engine for scientific literature. The Camtology system allows... more We describe a novel semantic search engine for scientific literature. The Camtology system allows for sentence-level searches of PDF files and combines text and image searches, thus facilitating the retrieval of information present in tables and figures. It allows the user to generate complex queries for search terms that are related through particular grammatical/semantic relations in an intuitive manner. The system uses Grid processing to parallelise the analysis of large numbers of papers.

Research paper thumbnail of Language Resources and Chemical Informatics

Proceedings of the …, Jan 1, 2008

Chemistry research papers are a primary source of information about chemistry, as in any scientif... more Chemistry research papers are a primary source of information about chemistry, as in any scientific field. The presentation of the data is, predominantly, unstructured information, and so not immediately susceptible to processes developed within chemical informatics for carrying out chemistry research by information processing techniques. At one level, extracting the relevant information from research papers is a text mining task, requiring both extensive language resources and specialised knowledge of the subject domain. However, the papers also encode information about the way the research is conducted and the structure of the field itself. Applying language technology to research papers in chemistry can facilitate eScience on several different levels.

Research paper thumbnail of Interlingua Development and Testing through Semantic Annotation of Multilingual Text Corpora

This paper describes a multi-site project to annotate the interlingual content of six sizable bil... more This paper describes a multi-site project to annotate the interlingual content of six sizable bilingual parallel corpora. The project addresses several principal problems in parallel: specification of interlingua content and notation, development of reliable annotation methods, and evaluation of annotated corpora. As a by-product, a growing corpus of annotated texts is being produced, which may eventually be useful for machine learning of semantics-based processing.

Research paper thumbnail of Semantic Annotation for Interlingual Representation of Multilingual Texts

Workshop …, Jan 1, 2004

This paper describes the annotation process being used in a multi-site project to create six siza... more This paper describes the annotation process being used in a multi-site project to create six sizable bilingual parallel corpora annotated with a consistent interlingua representation. After presenting the background and objectives of the effort, we describe the multilingual corpora and the three stages of interlingual representation being developed. We then focus on the annotation process itself, including an interface environment that supports the annotation task, and the methodology for evaluating the interlingua representation. Finally, we discuss some issues encountered during the annotation tasks. The resulting annotated multilingual corpora will be useful for a wide range of natural language processing research tasks, including machine translation, question answering, text summarization, and information extraction.

Research paper thumbnail of Information status distinctions and referring expressions: An empirical study of references to people in news summaries

Computational Linguistics, Jan 1, 2011

While there has been much theoretical work on using various information status distinctions to ex... more While there has been much theoretical work on using various information status distinctions to explain the form of references in written text, there have been few studies that attempt to automatically learn these distinctions for generating references in the context of computer regenerated text. In this article, we present a model for generating references to people in news summaries that incorporates insights from both theory and a corpus analysis of human written summaries. In particular, our model captures how two properties of a person referred to in the summary -familiarity to the reader and global salience in the news story -affect the content and form of the initial reference to that person in a summary. We demonstrate that these two distinctions can be learned from a typical input for multi-document summarization and that they can be used to make regeneration decisions that improve the quality of extractive summaries.

Research paper thumbnail of 10. Interlingual annotation of multilingual text corpora and FrameNet

Multilingual …, Jan 1, 2009

This article raises an issue of common interest to those interested in Interlinguas and interling... more This article raises an issue of common interest to those interested in Interlinguas and interlingual MT and those interested in developing a multilingual FrameNet. Specifically, it addresses the problem of teasing apart the difference between meaning and interpretation, between semantics and pragmatics and between semantic representation and the representation of conveyed information. No translation (nor paraphrase) conveys the exactly same information as the original utterance. Rather, additional information may be conveyed or information may be lost, and information originally expressed explicitly may be conveyed implicitly and vice versa. The semantic representation of an utterance (the result of integrating the semantic representations of the it subcomponents) does not capture what people intuitively feel is the meaning of an utterance but rather various pragmatic factors must be taken into account as well including the time and place of utterance and the speaker's motivation for uttering. The focus of the discussion is on describing IAMTC, a multi-site NSF-supported project to annotate six sizable bilingual parallel corpora for interlingual content. After setting out the basic issues, we present the background and objectives of the IAMTC annotation effort, the dataset being annotated, the interlingual representation language used, the annotator's interface and annotation process itself, along with the evaluation methodology and results of an initial evaluation. Finally, we conclude by summarizing the current state of the project and presenting a number of issues yet to be resolved.

Research paper thumbnail of Complex lexico-syntactic reformulation of sentences using typed dependency representations

Proceedings of the 6th International Natural Language …, Jan 1, 2010

We present a framework for reformulating sentences by applying transfer rules on a typed dependen... more We present a framework for reformulating sentences by applying transfer rules on a typed dependency representation. We specify a list of operations that the framework needs to support and argue that typed dependency structures are currently the most suitable formalism for complex lexico-syntactic paraphrasing. We demonstrate our approach by reformulating sentences expressing the discourse relation of causation using four lexico-syntactic discourse markers -"cause" as a verb and as a noun, "because" as a conjunction and "because of" as a preposition.

Research paper thumbnail of Parallel syntactic annotation of multiple languages

Proceedings of the …, Jan 1, 2006

Research paper thumbnail of Investigation into Human Preference between Common and Unambiguous Lexical Substitutions

abdn.ac.uk

We present a study that investigates that factors that determine what makes a good lexical substi... more We present a study that investigates that factors that determine what makes a good lexical substitution. We begin by observing that there is a correlation between the corpus frequency of words and the number of WordNet senses they have, and hypothesise that readers might prefer common, but more ambiguous words over less ambiguous but also less common ones. We identify four properties of a word that determine whether it is a suitable substitution in a given context, and ask volunteers to rank their preferences between two common but ambiguous lexical substitutions, and two uncommon but also unambiguous ones. Preliminary results suggest a slight preference towards the unambiguous.

Research paper thumbnail of Semantic Annotation of Multilingual Text Corpora

cfar.umd.edu

This paper describes a multi-site project to annotate six sizable bilingual parallel corpora for ... more This paper describes a multi-site project to annotate six sizable bilingual parallel corpora for interlingual content. After presenting the background and objectives of the effort, we describe the data set that is being annotated, the interlingua representation language used, an interface environment that supports the annotation task and the annotation process itself. We then present our evaluation methodology and conclude with a summary of the current status of the project along with a number of issues which have arisen.

Research paper thumbnail of IAMTC Report: Background for IAMTC Interlingua and Representational Comparison for Task G

Research paper thumbnail of Tell me a story about the birds and the bees: Using NLG to foster public engagement in nature conservation projects

For many important nature conservation programmes, western societies are increasingly reliant on ... more For many important nature conservation programmes, western societies are increasingly reliant on the activities of volunteers, who, collectively, have come to represent an unpaid work force of considerable size and importance. Although a variety of effective ways exist to generate adequate recruitment, volunteer retention is harder to achieve, particularly when schemes grow bigger or tasks get more difficult. We describe two case studies that we are using to investigate the hypothesis that richness of information provision, of the kind that can be provided by Natural Language Generation (NLG), can play a role in fostering volunteer interest and motivation. Both these case studies involve collaboration with large existing conservation projects, which provide the possibility for evaluation on a realistic scale.

Research paper thumbnail of Evaluating an Open Domain GRE algorithm on closed domains System IDs: CAM-B, CAM-T, CAM-BU and CAM-TU

csd.abdn.ac.uk

We present four variations of our 2004 incremental algorithm , and present results on both the Fu... more We present four variations of our 2004 incremental algorithm , and present results on both the Furniture and People datasets.

Research paper thumbnail of Inderjeet Mani and Mark T. Maybury (eds). Advances in Automatic Text Summarization. MIT Press, 1999. ISBN 0-262-13359-8. 442 pp. $47.95/£ 32.95 (paperback).

Natural Language Engineering, Jan 1, 2001

CJO Search Widget (Natural Language Engineering) What is this? ... Download a branded Cambridge J... more CJO Search Widget (Natural Language Engineering) What is this? ... Download a branded Cambridge Journals Online toolbar (for IE 7 only). What is this? ... Add Cambridge Journals Online as a search option in your browser toolbar. What is this? ... Inderjeet Mani and Mark T. ...

Research paper thumbnail of Generating research websites using summarisation techniques

Proceedings of the 46th Annual Meeting …, Jan 1, 2008

Research paper thumbnail of Ehud Reiter and Robert Dale. Building Natural Language Generation Systems. Cambridge University Press, 2000. $64.95/£ 37.50 (Hardback). 234 pages

Natural Language Engineering, Jan 1, 2001

... ADVAITH SIDDHARTHAN a1 a1 University of Cambridge e-mail: as372@cl.cam.ac.uk, ... 234 pages.A... more ... ADVAITH SIDDHARTHAN a1 a1 University of Cambridge e-mail: as372@cl.cam.ac.uk, ... 234 pages.ADVAITH SIDDHARTHAN (2001) Natural Language Engineering, Volume 7, Issue 03, September 2001 pp 271-274 http://journals.cambridge.org/abstract_S1351324901212704. ...

Research paper thumbnail of Intelligent Information Access from Scientific Papers

Current Challenges in …, Jan 1, 2011

We describe a novel search engine for scientific literature. The system allows for sentence-level... more We describe a novel search engine for scientific literature. The system allows for sentence-level search starting from portable document format (PDF) files, and integrates text and image search, thus facilitating the retrieval of information present in tables and figures. It allows the user to generate in an intuitive manner complex queries for search terms that are related through particular grammatical (and thus implicitly semantic) relations. The system uses grid processing to parallelise the analysis of large numbers of scientific papers. It is currently undergoing user evaluation, but we report some preliminary evaluation and comparison with Google Scholar, demonstrating its utility. Finally, we discuss future work and the potential and complimentarity of the system for patent search.

Research paper thumbnail of Text Simplification using Typed Dependencies: A Comparison of the Robustness of Different Generation Strategies

abdn.ac.uk

We present a framework for text simplification based on applying transformation rules to a typed ... more We present a framework for text simplification based on applying transformation rules to a typed dependency representation produced by the Stanford parser. We test two approaches to regeneration from typed dependencies: (a) gen-light, where the transformed dependency graphs are linearised using the word order and morphology of the original sentence, with any changes coded into the transformation rules, and (b) gen-heavy, where the Stanford dependencies are reduced to a DSyntS representation and sentences are generating formally using the RealPro surface realiser. The main contribution of this paper is to compare the robustness of these approaches in the presence of parsing errors, using both a single parse and an n-best parse setting in an overgenerate and rank approach. We find that the gen-light approach is robust to parser error, particularly in the n-best parse setting. On the other hand, parsing errors cause the realiser in the genheavy approach to order words and phrases in ways that are disliked by our evaluators.

Research paper thumbnail of Resolving Pronouns Robustly

Research paper thumbnail of Christopher D. Manning and Hinrich Schutze. Foundations of Statistical Natural Language Processing. MIT Press, 2000. ISBN 0-262-13360-1. 620 pp. $64.95/£ 44.95  …

Natural Language Engineering, Jan 1, 2002

Digital Library logo Take a look at the new version of this page: [ beta version ]. Tell us what ... more Digital Library logo Take a look at the new version of this page: [ beta version ]. Tell us what you think. ... Christopher D. Manning and Hinrich Schutze. Foundations of Statistical Natural Language Processing. MIT Press, 2000. ISBN 0-262-13360-1. 620 pp. $64.95/£44.95 (cloth).

Research paper thumbnail of Camtology: intelligent information access for science

Proceedings of the …, Jan 1, 2010

We describe a novel semantic search engine for scientific literature. The Camtology system allows... more We describe a novel semantic search engine for scientific literature. The Camtology system allows for sentence-level searches of PDF files and combines text and image searches, thus facilitating the retrieval of information present in tables and figures. It allows the user to generate complex queries for search terms that are related through particular grammatical/semantic relations in an intuitive manner. The system uses Grid processing to parallelise the analysis of large numbers of papers.

Research paper thumbnail of Language Resources and Chemical Informatics

Proceedings of the …, Jan 1, 2008

Chemistry research papers are a primary source of information about chemistry, as in any scientif... more Chemistry research papers are a primary source of information about chemistry, as in any scientific field. The presentation of the data is, predominantly, unstructured information, and so not immediately susceptible to processes developed within chemical informatics for carrying out chemistry research by information processing techniques. At one level, extracting the relevant information from research papers is a text mining task, requiring both extensive language resources and specialised knowledge of the subject domain. However, the papers also encode information about the way the research is conducted and the structure of the field itself. Applying language technology to research papers in chemistry can facilitate eScience on several different levels.

Research paper thumbnail of Interlingua Development and Testing through Semantic Annotation of Multilingual Text Corpora

This paper describes a multi-site project to annotate the interlingual content of six sizable bil... more This paper describes a multi-site project to annotate the interlingual content of six sizable bilingual parallel corpora. The project addresses several principal problems in parallel: specification of interlingua content and notation, development of reliable annotation methods, and evaluation of annotated corpora. As a by-product, a growing corpus of annotated texts is being produced, which may eventually be useful for machine learning of semantics-based processing.

Research paper thumbnail of Semantic Annotation for Interlingual Representation of Multilingual Texts

Workshop …, Jan 1, 2004

This paper describes the annotation process being used in a multi-site project to create six siza... more This paper describes the annotation process being used in a multi-site project to create six sizable bilingual parallel corpora annotated with a consistent interlingua representation. After presenting the background and objectives of the effort, we describe the multilingual corpora and the three stages of interlingual representation being developed. We then focus on the annotation process itself, including an interface environment that supports the annotation task, and the methodology for evaluating the interlingua representation. Finally, we discuss some issues encountered during the annotation tasks. The resulting annotated multilingual corpora will be useful for a wide range of natural language processing research tasks, including machine translation, question answering, text summarization, and information extraction.

Research paper thumbnail of Information status distinctions and referring expressions: An empirical study of references to people in news summaries

Computational Linguistics, Jan 1, 2011

While there has been much theoretical work on using various information status distinctions to ex... more While there has been much theoretical work on using various information status distinctions to explain the form of references in written text, there have been few studies that attempt to automatically learn these distinctions for generating references in the context of computer regenerated text. In this article, we present a model for generating references to people in news summaries that incorporates insights from both theory and a corpus analysis of human written summaries. In particular, our model captures how two properties of a person referred to in the summary -familiarity to the reader and global salience in the news story -affect the content and form of the initial reference to that person in a summary. We demonstrate that these two distinctions can be learned from a typical input for multi-document summarization and that they can be used to make regeneration decisions that improve the quality of extractive summaries.

Research paper thumbnail of 10. Interlingual annotation of multilingual text corpora and FrameNet

Multilingual …, Jan 1, 2009

This article raises an issue of common interest to those interested in Interlinguas and interling... more This article raises an issue of common interest to those interested in Interlinguas and interlingual MT and those interested in developing a multilingual FrameNet. Specifically, it addresses the problem of teasing apart the difference between meaning and interpretation, between semantics and pragmatics and between semantic representation and the representation of conveyed information. No translation (nor paraphrase) conveys the exactly same information as the original utterance. Rather, additional information may be conveyed or information may be lost, and information originally expressed explicitly may be conveyed implicitly and vice versa. The semantic representation of an utterance (the result of integrating the semantic representations of the it subcomponents) does not capture what people intuitively feel is the meaning of an utterance but rather various pragmatic factors must be taken into account as well including the time and place of utterance and the speaker's motivation for uttering. The focus of the discussion is on describing IAMTC, a multi-site NSF-supported project to annotate six sizable bilingual parallel corpora for interlingual content. After setting out the basic issues, we present the background and objectives of the IAMTC annotation effort, the dataset being annotated, the interlingual representation language used, the annotator's interface and annotation process itself, along with the evaluation methodology and results of an initial evaluation. Finally, we conclude by summarizing the current state of the project and presenting a number of issues yet to be resolved.

Research paper thumbnail of Complex lexico-syntactic reformulation of sentences using typed dependency representations

Proceedings of the 6th International Natural Language …, Jan 1, 2010

We present a framework for reformulating sentences by applying transfer rules on a typed dependen... more We present a framework for reformulating sentences by applying transfer rules on a typed dependency representation. We specify a list of operations that the framework needs to support and argue that typed dependency structures are currently the most suitable formalism for complex lexico-syntactic paraphrasing. We demonstrate our approach by reformulating sentences expressing the discourse relation of causation using four lexico-syntactic discourse markers -"cause" as a verb and as a noun, "because" as a conjunction and "because of" as a preposition.

Research paper thumbnail of Parallel syntactic annotation of multiple languages

Proceedings of the …, Jan 1, 2006