Fast and Intuitive Clustering of Web Documents (original) (raw)
Conventional document retrieval systems (e.g., Alta Vista) return long lists of ranked documents in response to user queries. Recently, document clustering has been put forth as an alternative method of organizing the results of a retrieval 6]. A person browsing the clusters can discover patterns that would be overlooked in the traditional ranked-list presentation. In this context, a document clustering algorithm has two key requirements. First, the algorithm ought to produce clusters that are easy-to-browse { a user needs to determine at a glance whether the contents of a cluster are of interest. Second, the algorithm has to be fast even when applied to thousands of documents with no preprocessing. This paper describes several novel clustering methods, which intersect the documents in a cluster to determine the set of words (or phrases) shared by all the documents in the cluster. We report on experiments that evaluate these intersection-based clustering methods on collections of snippets returned from Web search engines. First, we show that word-intersection clustering produces superior clusters and does so faster than standard techniques. Second, we show that our O(n log n) time phrase-intersection clustering method produces comparable clusters and does so more than two orders of magnitude faster than word-intersection.
Sign up for access to the world's latest research.
checkGet notified about relevant papers
checkSave papers to use in your research
checkJoin the discussion with peers
checkTrack your impact
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.