Information retrieval using the reduced row echelon form of a term-document matrix (original) (raw)

Information Retrieval Using Matrix Methods

Advances in Computer Science Research, 2022

This research is part of data mining, a subsection of information retrieval and text mining. This research focuses on finding an approach to getting relevant documents online news documents with a specific threshold value and improving computing performance to get relevant documents with large documents. In this case, the author use news from 3 news sites that are pretty popular in Indonesia, which are included in the top 10 Alexa Traffic Rank (ATR) 2021, namely tribunnews.com, detik.com, and liputan6.com. In searching for relevant news documents, the author determines the threshold value first by calculating the average similarity value of the documents used as the experimental sample. The resulting threshold value is a determinant of the similarity value of each document to be used. The author uses several techniques to assist the research process, such as text mining with the tala method and news document representation techniques using matrix methods, and finally utilizing the cosine size method to determine the similarity of documents with matrix-based search data. The results obtained indicate that the approach using the matrix method and the matrix compression process shows good computational results, so it will be useful for implementation on a large number of documents.

Using Linear Algebra for Intelligent Information Retrieval USING LINEAR ALGEBRA FOR INTELLIGENT INFORMATION RETRIEVAL

Currently, most approaches to retrieving textual materials from scienti c databases depend on a lexical match b e t ween words in users' requests and those in or assigned to documents in a database. Because of the tremendous diversity i n t h e w ords people use to describe the same document, lexical methods are necessarily incomplete and imprecise. Using the singular value decomposition (SVD), one can take a d v antage of the implicit higher-order structure in the association of terms with documents by determining the SVD of large sparse term by document matrices. Terms and documents represented by 200-300 of the largest singular vectors are then matched against user queries. We call this retrieval method Latent Semantic Indexing (LSI) because the subspace represents important associative relationships between terms and documents that are not evident in individual documents. LSI is a completely automatic yet intelligent indexing method, widely applicable, and a promising way to improve users' access to many kinds of textual materials, or to documents and services for which textual descriptions are available. A survey of the computational requirements for managing LSI-encoded databases as well as current and future applications of LSI is presented.

A Comparison of SVD, SVR, ADE and IRR for Latent Semantic Indexing

Communications in Computer and Information Science, 2009

Recently, singular value decomposition (SVD) and its variants, which are singular value rescaling (SVR), approximation dimension equalization (ADE) and iterative residual rescaling (IRR), were proposed to conduct the job of latent semantic indexing (LSI). Although they are all based on linear algebraic method for tem-document matrix computation, which is SVD, the basic motivations behind them concerning LSI are different from each other. In this paper, a series of experiments are conducted to examine their effectiveness of LSI for the practical application of text mining, including information retrieval, text categorization and similarity measure. The experimental results demonstrate that SVD and SVR have better performances than other proposed LSI methods in the above mentioned applications. Meanwhile, ADE and IRR, because of the too much difference between their approximation matrix and original term-document matrix in Frobenius norm, can not derive good performances for text mining applications using LSI.

Information retrieval performance enhancement using the average standard estimator and the multi-criteria decision weighted set of performance measures

2008

Information retrieval is much more challenging than traditional small document collection retrieval. The main difference is the importance of correlations between related concepts in complex data structures. These structures have been studied by several information retrieval systems. This research began by performing a comprehensive review and comparison of several techniques of matrix dimensionality estimation and their respective effects on enhancing retrieval performance using singular value decomposition and latent semantic analysis. Two novel techniques have been introduced in this research to enhance intrinsic dimensionality estimation, the Multi-criteria Decision Weighted model to estimate matrix intrinsic dimensionality for large document collections and the Average Standard Estimator (ASE) for estimating data intrinsic dimensionality based on the singular value decomposition (SVD). ASE estimates the level of significance for singular values resulting from the singular value...

Information Retrieval using Cosine and Jaccard Similarity Measures in Vector Space Model

International Journal of Computer Applications, 2017

With the exponential growth of documents available to us on the web, the requirement for an effective technique to retrieve the most relevant document matching a given search query has become critical. The field of Information Retrieval deals with the problem of document similarity to retrieve desired information from a large amount of data. Various models and similarity measures have been proposed to determine the extent of similarity between two objects. The objective of this paper is to summarize the entire process, looking into some of the most well-known algorithms and approaches to match a query text against a set of indexed documents.

Information Retrieval Using a Singular Value Decomposition Model of Latent Semantic Structure

Proceedings of the …, 1988

In a new method for automatic indexing and retrieval, implicit higher-order structure in the association of terms with documents is modeled to improve estimates of term-document association, and therefore the detection of relevant documents on the basis of terms found in queries. Singular-value decomposition is used to decompose a large term by document matrix into 50 to 150 orthogonal factors from which the original matrix can be approximated by linear combination; both documents and terms are represented as vectors in a 50-to 150dimensioal space. Queries are represented as pseudo-documents vectors formed from weighted combinations of terms, and documents are ordered by their similarity to the query. Initial tests find this automatic method very promising.

Applications of Linear Algebra to Information Retrieval

2009

Some of the theory of nonnegative matrices is first presented. The Perron-Frobenius theorem is highlighted. Some of the important linear algebraic methods of information retrieval are surveyed. Latent Semantic Indexing (LSI), which uses the singular value decomposition is discussed. The Hyper-Text Induced Topic Search (HITS) algorithm is next considered; here the power method for finding dominant eigenvectors is employed. Through the use of a theorem by Sinkohrn and Knopp, a modified HITS method is developed. Lastly, the PageRank algorithm is discussed. Numerical examples and MATLAB programs are also provided.

On the Use of the Singular Value Decomposition for Text Retrieval

Computational information …, 2001

On the Use of the Singular Value Decomposition for Text Retrieval* Parry Husbands, Horst Simon, and Chris HQ Ding* 1 Introduction The use of the Singular Value Decomposition (SVD) has been proposed for text retrieval in several recent works [2, 6]. This technique ...

Similarity Measure Algorithm for Text Document Clustering, Using Singular Value Decomposition

Current Journal of Applied Science and Technology

We examined a similarity measure between text documents clustering. Data mining is a challenging field with more research and application areas. Text document clustering, which is a subset of data mining helps groups and organizes a large quantity of unstructured text documents into a small number of meaningful clusters. An algorithm which works better by calculating the degree of closeness of documents using their document matrix was used to query the terms/words in each document. We also determined whether a given set of text documents are similar/different to the other when these terms are queried. We found that, the ability to rank and approximate documents using matrix allows the use of Singular Value Decomposition (SVD) as an enhanced text data mining algorithm. Also, applying SVD to a matrix of a high dimension results in matrix of a lower dimension, to expose the relationships in the original matrix by ordering it from the most variant to the lowest.

Comparison of VSM, GVSM, and Lsi in Information Retrieval for Indonesian Text

Jurnal Teknologi, 2016

Vector space model (VSM) is an Information Retrieval (IR) system model that represents query and documents as n-dimension vector. GVSM is an expansion from VSM that represents the documents base on similarity value between query and minterm vector space of documents collection. Minterm vector is defined by the term in query. Therefore, in retrieving a document can be done base on word meaning inside the query. On the contrary, a document can consist the same information semantically. LSI is a method implemented in IR system to retrieve document base on overall meaning of users’ query input from a document, not based on each word translation. LSI uses a matrix algebra technique namely Singular Value Decomposition (SVD). This study discusses the performance of VSM, GVSM and LSI that are implemented on IR to retrieve Indonesian sentences document of .pdf, .doc and .docx extension type files, by using Nazief and Adriani stemming algorithm. Each method implemented either by thread or no-...