Rohit J V | National Institute of Technology, Tiruchirappalli (original) (raw)

Rohit  J V

Related Authors

International Journal of Computer Science & Information Technology  (IJCSIT)

Ryan R Rosario

Amit Chakraborty

Ali Daud Associate Professor

Asli Celikyilmaz

Dan Roth

Dan Roth

University of Illinois at Urbana-Champaign

Daud  Muhammad

Uploads

Papers by Rohit J V

Research paper thumbnail of TOPIC MODELING: CLUSTERING OF DEEP WEBPAGES

The internet is comprised of massive amount of information in the form of zillions of web pages.T... more The internet is comprised of massive amount of information in the form of zillions of web pages.This information can be categorized into the surface web and the deep web. The existing search engines can effectively make use of surface web information.But the deep web remains unexploited yet. Machine learning techniques have been commonly employed to access deep web content.

Under Machine Learning, topic models provide a simple way to analyze large volumes of unlabeled text. A "topic" consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between words with multiple meanings. Clustering is one of the key solutions to organize the deep web databases.In this paper, we cluster deep web databases based on the relevance found among deep web forms by employing a generative probabilistic model called Latent Dirichlet Allocation(LDA) for modeling content representative of deep web databases. This is implemented after preprocessing the set of web pages to extract page contents and form contents.Further, we contrive the distribution of “topics per document” and “words per topic” using the technique of Gibbs sampling. Experimental results show that the proposed method clearly outperforms the existing clustering methods

Research paper thumbnail of TOPIC MODELING: CLUSTERING OF DEEP WEBPAGES

The internet is comprised of massive amount of information in the form of zillions of web pages.T... more The internet is comprised of massive amount of information in the form of zillions of web pages.This information can be categorized into the surface web and the deep web. The existing search engines can effectively make use of surface web information.But the deep web remains unexploited yet. Machine learning techniques have been commonly employed to access deep web content.

Under Machine Learning, topic models provide a simple way to analyze large volumes of unlabeled text. A "topic" consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between words with multiple meanings. Clustering is one of the key solutions to organize the deep web databases.In this paper, we cluster deep web databases based on the relevance found among deep web forms by employing a generative probabilistic model called Latent Dirichlet Allocation(LDA) for modeling content representative of deep web databases. This is implemented after preprocessing the set of web pages to extract page contents and form contents.Further, we contrive the distribution of “topics per document” and “words per topic” using the technique of Gibbs sampling. Experimental results show that the proposed method clearly outperforms the existing clustering methods

Log In