Two uses for updating the partial singular value decomposition in latent semantic indexing (original) (raw)
2008, Applied Numerical Mathematics
Latent Semantic Indexing (LSI) is an information retrieval (IR) method that connects IR with numerical linear algebra by representing a dataset as a term-document matrix. Because of the tremendous size of modern databases, such matrices can be extremely large. The partial singular value decomposition (PSVD) is a matrix factorization that captures the salient features of a matrix while using much less storage. We look at two challenges posed by this PSVD data compression process in LSI. First we note that traditional methods of computing the PSVD are very expensive; most of the processing time in LSI is spent in calculating the PSVD of the term-document matrix. In a rapidly expanding environment such as the Internet, the term-document matrix is altered often as new documents and terms are added. Updating the PSVD of this matrix is much more efficient than recalculating it after each change. Thus, the first challenge is efficiently updating the PSVD when the matrix is altered slightly. The second challenge is calculating the PSVD efficiently in terms of computational and memory requirements. We investigate the use of the PSVD updating methods proposed by Zha and Simon [H. Zha, H.D. Simon, On updating problems in latent semantic indexing, SIAM J. Sci. Comput. 21 (2) (1999) 782-791] to meet both of these challenges. Results are presented illustrating that updating in this manner provides substantial savings in computation time, with no significant reduction in accuracy. An algorithm for iteratively computing the PSVD of a matrix using the document updating method is also presented. This iterative method, suggested by Zha and Zhang [H. Zha, Z. Zhang, Matrices with low-rank-plus-shift structure: partial SVD and latent semantic indexing, SIAM J. Matrix Anal. Appl. 21 (2) (1999) 522-536], provides a means of calculating the PSVD for matrices so large that the computation would be infeasible using traditional methods. Again, results are given showing that this method can provide savings in memory resources and computational time without compromising the accuracy of the results.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.