Merrill Hutchison - Academia.edu (original) (raw)
Related Authors
Centre National de la Recherche Scientifique / French National Centre for Scientific Research
Uploads
Papers by Merrill Hutchison
First, we survey the nonstandard or exaggerated linguistic characteristics that Englishlanguage g... more First, we survey the nonstandard or exaggerated linguistic characteristics that Englishlanguage genealogical text (and indeed that of other languages) often exhibits. For example, in English genealogical prose frequent repetition of subject pronouns is avoided---they are simply dropped, though this would usually be considered ungrammatical except in diaries. Also, genealogical text frequently mentions names, dates, and places in ways that cause problems for traditional natural language processing (NLP) systems. We briefly illustrate how variation from grammatical norms is also common in other languages for genealogical text, though for this talk we focus on English. We discuss how this type of prose is typically preprocessed and tokenized, and then mention how our approach is implemented as the first stage in our integrated system. The result of our integrated approach, that of preprocessing raw genealogical text, is render it more amenable to subsequent linguistic-based treatment.
Selected Proceedings of the …, 2006
Overview This paper introduces a new system that has been developed specifically for processing a... more Overview This paper introduces a new system that has been developed specifically for processing abbreviated text from an information-extraction point of view. Each of the three principal components of the system-text preprocessing, parsing for content, and discourse processing-is discussed in turn. Approaches used in the development of the knowledge sources are mentioned, with a particular focus on the linguistic issues involved. Examples of how sentences are processed are given, and strengths of the system are stressed. Finally, conclusions are drawn and future work and applications are sketched. Introducing LG-Soar The new system, called LG-Soar, represents the integration of three major processing components: (i) regular-expression-based text preprocessing; (ii) the Link Grammar parser; and (iii) the Soar intelligent agent architecture. The result is a robust, versatile text processing engine useful for difficult-to-handle input. Why a new Soar parser? NL-Soar is: Designed for cognitive modeling of natural language use Not (yet) versatile enough to handle grammatically problematic text The project presented several interesting challenges from an NLP perspective. The overall goal was to mine content from problematic text. Most currently existing systems only perform well on well-structured, completely grammatical text. Another problem was to address complicated linguistic issues in the development of a usable system. Another goal was to output the information into a variety of usable formats. Finally, the project was meant to test the feasibility of integrating this particular set of components within a unified agent architecture. The system LG-Soar component of the system operates as follows: 1) An entry is read in from a pre-processed input file. 2) Each entry is split into individual sentences. 3) Each sentence is parsed with the Link Grammar. 4) The discourse representation module creates semantic/discourse representations of link content for all sentences in the entry. 5) Output is generated according to various formats. These steps are discussed in further detail in the rest of the paper.
First, we survey the nonstandard or exaggerated linguistic characteristics that Englishlanguage g... more First, we survey the nonstandard or exaggerated linguistic characteristics that Englishlanguage genealogical text (and indeed that of other languages) often exhibits. For example, in English genealogical prose frequent repetition of subject pronouns is avoided---they are simply dropped, though this would usually be considered ungrammatical except in diaries. Also, genealogical text frequently mentions names, dates, and places in ways that cause problems for traditional natural language processing (NLP) systems. We briefly illustrate how variation from grammatical norms is also common in other languages for genealogical text, though for this talk we focus on English. We discuss how this type of prose is typically preprocessed and tokenized, and then mention how our approach is implemented as the first stage in our integrated system. The result of our integrated approach, that of preprocessing raw genealogical text, is render it more amenable to subsequent linguistic-based treatment.
Selected Proceedings of the …, 2006
Overview This paper introduces a new system that has been developed specifically for processing a... more Overview This paper introduces a new system that has been developed specifically for processing abbreviated text from an information-extraction point of view. Each of the three principal components of the system-text preprocessing, parsing for content, and discourse processing-is discussed in turn. Approaches used in the development of the knowledge sources are mentioned, with a particular focus on the linguistic issues involved. Examples of how sentences are processed are given, and strengths of the system are stressed. Finally, conclusions are drawn and future work and applications are sketched. Introducing LG-Soar The new system, called LG-Soar, represents the integration of three major processing components: (i) regular-expression-based text preprocessing; (ii) the Link Grammar parser; and (iii) the Soar intelligent agent architecture. The result is a robust, versatile text processing engine useful for difficult-to-handle input. Why a new Soar parser? NL-Soar is: Designed for cognitive modeling of natural language use Not (yet) versatile enough to handle grammatically problematic text The project presented several interesting challenges from an NLP perspective. The overall goal was to mine content from problematic text. Most currently existing systems only perform well on well-structured, completely grammatical text. Another problem was to address complicated linguistic issues in the development of a usable system. Another goal was to output the information into a variety of usable formats. Finally, the project was meant to test the feasibility of integrating this particular set of components within a unified agent architecture. The system LG-Soar component of the system operates as follows: 1) An entry is read in from a pre-processed input file. 2) Each entry is split into individual sentences. 3) Each sentence is parsed with the Link Grammar. 4) The discourse representation module creates semantic/discourse representations of link content for all sentences in the entry. 5) Output is generated according to various formats. These steps are discussed in further detail in the rest of the paper.