Exploiting background information in knowledge discovery from text (original) (raw)

Mining associations in text in the presence of background knowledge

1996

This paper describes the FACT system for knowledge discovery from text. It discovers associations − patterns of co-occurrence − amongst keywords labeling the items in a collection of textual documents. In addition, FACT is able to use background knowledge about the keywords labeling the documents in its discovery process. FACT takes a query-centered view of knowledge discovery, in which a discovery request is viewed as a query over the implicit set of possible results supported by a collection of documents, and where background knowledge is used to specify constraints on the desired results of this query process. Execution of a knowledge-discovery query is structured so that these background-knowledge constraints can be exploited in the search for possible results. Finally, rather than requiring a user to specify an explicit query expression in the knowledge-discovery query language, FACT presents the user with a simple-to-use graphical interface to the query language, with the language providing a well-defined semantics for the discovery actions performed by a user through the interface.

Knowledge Discovery in Textual Databases (KDT)

1995

The information age is characterized by a rapid growth in the amount of information available in electronic media. Traditional data handling methods are not adequate to cope with this information flood. Knowledge Discovery in Databases (KDD) is a new paradigm that focuses on computerized exploration of large amounts of data and on discovery of relevant and interesting patterns within them. While most work on KDD is concerned with structured databases, it is clear that this paradigm is required for handling the huge amount of information that is available only in unstructured textual form. To apply traditional KDD on texts it is necessary to impose some structure on the data that would be rich enough to allow for interesting KDD operations. On the other hand, we have to consider the severe limitations of current text processing technology and define rather simple structures that can be extracted from texts fairly automatically and in a reasonable cost. We propose using a text categorization paradigm to annotate text articles with meaningful concepts that are organized in hierarchical structure. We suggest that this relatively simple annotation is rich enough to provide the basis for a KDD framework, enabling data summarization, exploration of interesting patterns, and trend analysis. This research combines the KDD and text categorization paradigms and suggests advances to the state of the art in both areas.

A System for Knowledge Discovery in Big Dynamical Text Collections

2012

Software system Cordiet-FCA is presented, which is designed for knowledge discovery in big dynamic data collections, including texts in natural language. Cordiet-FCA allows one to compose ontology-controlled queries and outputs concept lattice, implication bases, association rules, and other useful concept-based artifacts. Efficient algorithms for data preprocessing, text processing, and visualization of results are discussed. Examples of applying the system to problems of medical diagnostics, criminal investigations are considered.

Relationship Discovery in Large Text Collections Using Latent Semantic Indexing

Proceedings of the Fourth Workshop on Link Analysis, 2006

This paper addresses the problem of information discovery in large collections of text. For users, one of the key problems in working with such collections is determining where to focus their attention. In selecting documents for examination, users must be able to formulate reasonably precise queries. Queries that are too broad will greatly reduce the efficiency of information discovery efforts by overwhelming the users with peripheral information. In order to formulate efficient queries, a mechanism is needed to automatically alert users regarding potentially interesting information contained within the collection. This paper presents the results of an experiment designed to test one approach to generation of such alerts. The technique of latent semantic indexing (LSI) is used to identify relationships among entities of interest. Entity extraction software is used to pre-process the text of the collection so that the LSI space contains representation vectors for named entities in addition to those for individual terms. In the LSI space, the cosine of the angle between the representation vectors for two entities captures important information regarding the degree of association of those two entities. For appropriate choices of entities, determining the entity pairs with the highest mutual cosine values yields valuable information regarding the contents of the text collection. The test database used for the experiment consists of 150,000 news articles. The proposed approach for alert generation is tested using a counterterrorism analysis example. The approach is shown to have significant potential for aiding users in rapidly focusing on information of potential importance in large text collections. The approach also has value in identifying possible use of aliases.

A System For Information Extraction And Intelligent Search Using Dynamically Acquired Background Knowledge

2003

This paper presents a simple framework for extracting information found in publications or documents that are issued in large volumes and which cover similar concepts or issues within a given domain. The general aim of the work described, is to present a model for automatically augmenting segments of these documents with metadata using dynamically acquired background domain knowledge in order to assist users in easily locating information within these documents through a structured front end. To realize this ...

Concept-based knowledge discovery in texts extracted from the web

Sigkdd Explorations, 2000

This paper presents an approach for knowledge discovery in texts extracted from the Web. Instead of analyzing words or attribute values, the approach is based on concepts, which are extracted from texts to be used as characteristics in the mining process. Statistical techniques are applied on concepts in order to find interesting patterns in concept distributions or associations. In this way, users can perform discovery in a high level, since concepts describe real world events, objects, thoughts, etc. For identifying concepts in texts, a categorization algorithm is used associated to a previous classification task for concept definitions. Two experiments are presented: one for political analysis and other for competitive intelligence. At the end, the approach is discussed, examining its problems and advantages in the Web context.

Hyperdictionary: a Knowledge Discovery Tool to Help Information Retrieval 1

In this paper, we present a knowledge discovery tool to help the information retrieval process. This tool, which we define as Hyperdictionary, is capable of acquiring and holding relationships among words to describe the context of some information – the knowledge about some subject. This knowledge is then used to retrieve information by contextual search in a document collection. Our preliminary studies indicate that this structure can be used in conjunction with some information retrieval tool, to aid the user in the elaboration of his query. The user's query is automatically expanded with words strongly related to the information the user is wanting, facilitating the retrieval process.

Knowledge Discovery in Text Mining using Association Rule Extraction

International Journal of Computer Applications, 2016

Internet and information technology are the platform where huge amount of information is available to use. But searching the exact information for some knowledge is time consuming and results confusion in dealing with it. Retrieving knowledge manually from collection of web documents and database may cause to miss the track for user. Text mining is helpful to user to find accurate information or knowledge discovery and features in the text documents. Thus there is need to develop text mining approach which clearly guides the user about what is important information and what is not, how to deal with important information, how to generate knowledge etc. Knowledge discovery is an increasing field in the research. For a user reading the collection of documents and get some knowledge is time consuming and less effective. There has been a significant improvement in the research related to generating Knowledge Discovery from collection of documents. We propose a method of generating Knowledge Discovery in Text mining using Association Rule Extraction. Using this approach the users are able to find accurate and important knowledge from the collection of web documents which will reduce time for reading all those documents.

101. Clasitex+, a tool for knoweledge discovery from texts

Proceedings of the III Iberoamerican workshop on Pattern Recognition, TIARP-98, 1998

Last years are remarkable in the rapid growth of available knowledge through electronic media. Traditional data handling methods are becoming less and less capable to fulfil the demands of this information deluge. Therefore, several strategies have been proposed to do fast recovery, search and in general intelligent “analysis” of the information. All these strategies can lie on what is called Data Mining. Most of the existing work has been done on structured (i. e., numeric) databases. Nevertheless, a large portion of available information is in collection of texts written in Spanish, English, or other natural languages (histories, newspaper articles, email messages, web pages, etc.). The problem to find interesting things in a collection of documents has been termed by Ronnen Feldman and H. Hirsh [1] as "knowledge discovery from text" and the term “text mining” has been used to refer to research in this arena. It is thus very interesting and worthwhile to develop tools to extract non-trivial information from a non-structured (i. e., textual) data base in a reasonable time.