Text Indexing Research Papers - Academia.edu (original) (raw)
The SPIRIT search engine provides a test bed for the development of web search technology that is specialised for access to geographical information. Major components include the user interface, a geographical ontology, maintenance and... more
The SPIRIT search engine provides a test bed for the development of web search technology that is specialised for access to geographical information. Major components include the user interface, a geographical ontology, maintenance and retrieval functions for a test collection of web documents, textual and spatial indexes, relevance ranking and metadata extraction. Here we summarise the functionality and interaction between these components before focusing on the design of the geo-ontology and the development of spatio-textual indexing methods. The geo-ontology supports functionality for disambiguation, query expansion, relevance ranking and metadata extraction. Geographical place names are accompanied by multiple geometric footprints and qualitative spatial relationships. Spatial indexing of documents has been integrated with text indexing through the use of spatio-textual keys in which terms are concatenated with spatial cells to which they relate. Preliminary experiments demonstrate considerable performance benefits when compared with pure text indexing and with text indexing followed by a spatial filtering stage.
Conventional Information Retrieval Systems (IRSs), also called text indexers, deal with plain text documents or ones with a very elementary structure. These kinds of system are able to solve queries in a very efficient way, but they... more
Conventional Information Retrieval Systems (IRSs), also called text indexers, deal with plain text documents or ones with a very elementary structure. These kinds of system are able to solve queries in a very efficient way, but they cannot take into account tags which mark different sections, or at best this capability is very limited. In contrast with this, nowadays, documents which are part of a corpus often have a rich structure. They are structured using XML (Extensible Markup Language)[1] or in some other format which can be converted to XML in a more or less simple way. So, building classical IRSs to work with these kinds of corpus will not benefit from this structure and results will not be improved. In addition, several of these corpora are very large and include hundreds or thousands of documents which in turn include millions or hundreds of millions of words. Therefore, there is the need to build efficient and flexible IRSs which work with large structured corpora.
The information environment is seen to be one of the predominant factors for effective maintenance and inspection systems in the operation of commercial aircraft. The design issues can be stated simply as decisions on what information to... more
The information environment is seen to be one of the predominant factors for effective maintenance and inspection systems in the operation of commercial aircraft. The design issues can be stated simply as decisions on what information to present, when to present this information, and how to present this information. It is desirable that in answering these questions, the designer accounts for the cognitive abilities of humans and the demands that the task requirements generate. This paper provides a framework for information design by combining the concepts from the human factors knowledge base with the specific needs of aircraft inspection. This framework captures the interaction between the inspection task and its information requirements, leading to an analysis of the information needs of aircraft inspectors, using this framework and the cognitive control categories of Skill-Rule-Knowledge based behaviors. Based on this analysis, guidelines for information systems design have been suggested.
A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can be efficiently implemented using a compressed... more
A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can be efficiently implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts querying the text of the document, plus some parts querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution ...
We consider a set of natural language processing techniques based on finite-state technology that can be used to analyze huge amounts of texts. These techniques include an advanced tokenizer, a part-of-speech tagger that can manage... more
We consider a set of natural language processing techniques based on finite-state technology that can be used to analyze huge amounts of texts. These techniques include an advanced tokenizer, a part-of-speech tagger that can manage ambiguous streams of words, a system for containing words by means of derivational mechanisms, and a shallow parser to extract syntactic-dependency pairs. We propose to use these techniques in order to improve the performance of standard indexing engines.
In this article we present the MIRTO platform -under development at the University Stendhal of Grenoble- and how it addresses common flaws of CALL software. This platform led to another project: the creation of a pedagogically indexed... more
In this article we present the MIRTO platform -under development at the University Stendhal of Grenoble- and how it addresses common flaws of CALL software. This platform led to another project: the creation of a pedagogically indexed text base. We introduce here the notion of pedagogical indexation, and confront the particular case of pedagogical indexation for language learning with the existing pedagogical resource description standards, before proposing leads towards the implementation of the former. (http://www.formatex.org/micte2005/165.pdf)
Due to the popularity of the XML data format, several query languages for XML have been proposed, specially devised to handle data of which the structure is unknown, loose, or absent. While these languages are rich enough to allow for... more
Due to the popularity of the XML data format, several query languages for XML have been proposed, specially devised to handle data of which the structure is unknown, loose, or absent. While these languages are rich enough to allow for querying the content and ...
In this paper we describe the geographic information retrieval system developed by the Multimedia & Information Systems team for GeoCLEF 2006 and the results achieved. We detail our methods for generating and applying co-occurrence models... more
In this paper we describe the geographic information retrieval system developed by the Multimedia & Information Systems team for GeoCLEF 2006 and the results achieved. We detail our methods for generating and applying co-occurrence models for the purpose of place name disambiguation, our use of named entity recognition tools and text indexing applications. The presented system is split into two stages: a batch text & geographic indexer and a real time query engine. The query engine takes manually crafted queries where the text component is separated from the geographic component. Two monolingual runs were submitted for the GeoCLEF evaluation, the first constructed from the title and description, the second included the narrative also. We explain in detail our use of co-occurrence models for place name disambiguation using a model generated from Wikipedia. The paper concludes with a full description of future work and ways in which the system could be optimised.
A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can be efficiently implemented using a compressed... more
A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can be efficiently implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts querying the text of the document, plus some parts querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution ...
Terminology management is a key component of many natural language processing activities such as machine translation (Langlais and Carl, 2004), text summarization and text indexation. With the rapid development of science and technology... more
Terminology management is a key component of many natural language processing activities such as machine translation (Langlais and Carl, 2004), text summarization and text indexation. With the rapid development of science and technology continuously increasing the number of technical terms, terminology management is certain to become of the utmost importance in more and more content-based applications.
A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can be efficiently implemented using a compressed... more
A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can be efficiently implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts querying the text of the document, plus some parts querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution ...