Human Performance on Clustering Web Pages (original) (raw)

With the increase in information on the World Wide Web it has become difficult to find desired information quickly without using multiple queries or using a topic-specific search engine. One way to help in the search is by grouping HTML pages together that appear in some way to be related. In order to better understand this task, we performed an initial study of human clustering of web pages, in the hope that it would provide some insight into the difficulty of automating this task. Our results show that subjects did not cluster identically; in fact, on average, any two subjects had little similarity in their web-page clusters. We also found that subjects generally created rather small clusters, and those with access only to URLs created fewer clusters than those with access to the full text of each web page. Generally the overlap of documents between clusters for any given subject increased when given the full text, as did the percentage of documents clustered. When analyzing individual subjects, we found that each had different behavior across queries, both in terms of overlap, size of clusters, and number of clusters. These results provide a sobering note on any quest for a single clearly correct clustering method for web pages. 1 1 A slightly condensed version of this paper was published in . centroids and clusters based on similarity to those centroids. A recent HAC-based method, Word-Intersection Clustering , clusters based on phrases and allows for overlapping clusters. Another interactive approach, Scatter/Gather [3,, lets the user navigate through the retrieved results and dynamically clusters based on this navigation. A K-means method is used to cluster documents and find important words for each of those clusters.