Framework for Document Retrieval using Latent Semantic Indexing (original) (raw)

A Comparison of SVD, SVR, ADE and IRR for Latent Semantic Indexing

Communications in Computer and Information Science, 2009

Recently, singular value decomposition (SVD) and its variants, which are singular value rescaling (SVR), approximation dimension equalization (ADE) and iterative residual rescaling (IRR), were proposed to conduct the job of latent semantic indexing (LSI). Although they are all based on linear algebraic method for tem-document matrix computation, which is SVD, the basic motivations behind them concerning LSI are different from each other. In this paper, a series of experiments are conducted to examine their effectiveness of LSI for the practical application of text mining, including information retrieval, text categorization and similarity measure. The experimental results demonstrate that SVD and SVR have better performances than other proposed LSI methods in the above mentioned applications. Meanwhile, ADE and IRR, because of the too much difference between their approximation matrix and original term-document matrix in Frobenius norm, can not derive good performances for text mining applications using LSI.

Latent Semantic Analysis for Information Retrieval

Abstract—This paper presents a statistical method for analysis and processing of text using a technique called Latent Semantic Analysis. Latent semantic analysis was a technique that was devised to mimic human understanding of words and language. Hence it is a method for computer simulation of the meaning of word and passages by analysis of natural language or text. It uses a mathematical model called Singular Value Decomposition which is a technique used to factorize a matrix. The paper discusses its application in information retrieval, which is called latent semantic indexing in this context. We also present an example which demonstrates this technique.

Indexing by Latent Semantic Analysis

Journal of The American Society for Information Science and Technology, 1990

A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents ("semantic structure") in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

Information Retrieval Using a Singular Value Decomposition Model of Latent Semantic Structure

Proceedings of the …, 1988

In a new method for automatic indexing and retrieval, implicit higher-order structure in the association of terms with documents is modeled to improve estimates of term-document association, and therefore the detection of relevant documents on the basis of terms found in queries. Singular-value decomposition is used to decompose a large term by document matrix into 50 to 150 orthogonal factors from which the original matrix can be approximated by linear combination; both documents and terms are represented as vectors in a 50-to 150dimensioal space. Queries are represented as pseudo-documents vectors formed from weighted combinations of terms, and documents are ordered by their similarity to the query. Initial tests find this automatic method very promising.

Latent Semantic Indexing based Intelligent Information Retrieval System for Digital Libraries

Journal of Computing and Information Technology, 2006

To the information retrieval research community, a digital library can be viewed as an extended information retrieval system. The primary goal of an information retrieval system is to retrieve all the relevant documents, which are relevant to the user query. Disparities between the vocabulary of the system's authors and that of their users pose difficulties when information is processed without human intervention. In this paper, we present a novel approach to enhance the efficiency of the information retrieval system using intelligent information processing technique. Experiments carried out are giving most encouraging results.

An Indiscriminate Semantic Similarity Search Using Latent Semantic Indexing in Data Mining

2016

In this paper we propose a Semantic model of an information system that provides precise definitions of fundamental concepts like Query, subquery, and coupling.Queries are mapped to this space with documents being retrieved based on similarity Model. In this paper, the performance in document retrieval is investigated and compared with traditional term matching techniques with the help of LSI, CHOLSKEY Transform, LU-Decomposition, Natural language processing for improving searching redundancy with stemming and stop words processing. After that we have to apply for singular value decomposition for gaining the motivation for fetching records and information from document which would be same in the meaning and texture. While users want to search through information based on conceptual content, natural languages have limited the expression of these concepts for individual words contained in user’s queries, may not explicitly specify the intended user’s concept, which may result in the r...

New Information Retrieval Approach Based on Semantic Indexing by Meaning

Proceedings of the 16th International Conference on Applied Computing 2019, 2019

An Information Retrieval System (IRS) offers a number of tools and techniques, which enable to locate and visualize the relevant information needed. This information, is expressed by the user in the form of a query natural language. However, the representation of documents and the query in a traditional IRS lead to a lexical-centered relevance estimation which is, in fact, less efficient than a semantic-focused estimation. As a consequence, the documents that are actually relevant are not being recovered if they do not share words with the query, while the documents non relevant, which are words in common with the query, are recovered even though at times they do not have the meaning intended. This paper tackles this problem while suggesting a solution in the level of indexation of an IRS allowing it to improve its performance. To be more precise, we suggest a new approach of semantic indexation allowing to lead to the exact meaning of each term in a document or query undergoing a contextual analysis at the sentence level. In fact, if the system is able to comprehend the need of the user, then consequently it is perfectly capable to respond to it. Add to that, we suggest a simple method allowing to apply any model of IR on our new index table without changing its original bases making it faster. In order to validate this proposed approach, this new created system is evaluated base on numerous collections naming "TIME", "BBC", "The Guardian" and "BigThink". The results based on the experiments indicate the efficacy of our hypothesis compared to traditional IR approaches.

Latent Semantic Indexing : An Overview 1 Latent Semantic Indexing : An overview INFOSYS 240 Spring 2000 Final Paper

2001

Typically, information is retrieved by literally matching terms in documents with those of a query. However, lexical matching methods can be inaccurate when they are used to match a user's query. Since there are usually many ways to express a given concept (synonymy), the literal terms in a user's query may not match those of a relevant document. In addition, most words have multiple meanings (polysemy), so terms in a user's query will literally match terms in irrelevant documents. A better approach would allow users to retrieve information on the basis of a conceptual topic or meaning of a document. Latent Semantic Indexing (LSI) [Deerwester et al] tries to overcome the problems of lexical matching by using statistically derived conceptual indices instead of individual words for retrieval. LSI assumes that there is some underlying or latent structure in word usage that is partially obscured by variability in word choice. A truncated singular value decomposition (SVD) is...

semantic_indexing_Acling2018.pdf

Ontological Optimization for Latent Semantic Indexing of Arabic Corpus, 2018

The dimensionality reduction is a critical problem in the information retrieval process. The higher dimensions directly affect the search performance in terms of Recall and Precision. The dimensionality reduction enabling the search to be semantically based instead of lexically based as the dimensions are defined in terms of the semantic concepts instead of traditional terms or keywords. Latent Semantic Indexing (LSI) is a mathematical extension of the classical Vector Space Model (VSM). LSI is used to discover the latent semantic in the search space by extracting concepts from the original terms in the space. LSI is based on the Singular Value Decomposition (SVD) to reduce the dimension of the term space into a lower dimensional LSI space. In this paper, we propose a methodology for extra optimal LSI dimension reduction via two reduction levels. The first reduction level is based on an ontological conceptualization process. The Universal Wordnet ontology (UWN) is used to develop an ontological based concept space instead of the term space. As a second reduction level, the SVD is applied to the extracted concept space for getting an optimal LSI conceptualization. The experimental results of this research indicate an improvement in the search results in terms of both Precision and Recall as the proposed methodology addresses the Synonymy and Polysemy problems effectively.

Concept Based Search Using LSI and Automatic Keyphrase Extraction

Proceedings of the 2010 3rd International Conference on Emerging Trends in Engineering and Technology, 2010

Classic information retrieval model might lead to poor retrieval due to unrelated documents that might be included in the answer set or missed relevant documents that do not contain at least one index term. Retrieval based on index terms is vague and noisy. The user information need is more related to concepts and ideas than to index terms. Latent Semantic Indexing (LSI) model is a concept-based retrieval method which overcomes many of the problems evident in today's popular word-based retrieval systems. Most retrieval systems match words in the user's queries with words in the text of documents in the corpus; whereas LSI model performs the match based on the concepts. In order to perform concept mapping, Singular Value Decomposition (SVD) is used. Also keyphrases are an important means of document summarization, clustering and topic search. Keyphrases give high level description of document contents that indeed makes it easy for perspective readers to decide whether or not it is relevant to them. In this paper, we first develop an automatic keyphrase extraction model for extracting keyphrases from documents and then use these keyphrases as a corpus on which conceptual search will be performed using LSI.