. The algorithm in this paper simulates a web page user visit and how the user finds the main content block position in the page. The proposed method is tag independent and has two phases to accomplish the extraction job. First it transforms input DOM tree obtained from input HTML detailed web page into a block tree based on their visual representation and DOM structure in a way that on every node it will have specification vector, then it traverses the obtained small block tree to find main block having dominant computed value in comparison with other block nodes based on its specification vector values. The introduced method doesn't have any learning phases and could find informative content on any random input detailed web page. This method has been tested in large variety of websites and as we will show, it gains better precision and recall based on other compared method K-FE.
Analysis of DOM Based Automatic Web Content Extraction
2013
The World Wide Web plays an important role while searching for information in the data network. This paper deals with research in the area of automatic extraction of textual and non-textual information. Developed method consists of two data types of extractions i.e. image and text data extraction. The extraction is performed using the concepts of Document Object Model (DOM) tree. The paper presents a series of data filters to detect and remove irrelevant data from the web page. Many web applications adopt AJAX to enhance their user experience. But AJAX has a number of properties making it extremely difficult for traditional search engines to crawl. The paper proposed an AJAX crawling scheme based on DOM and breadth-first AJAX crawling algorithm. keyword: DOM, Extraction images, Content Extraction.
Extracting context to improve accuracy for HTML content extraction
Special interest tracks and posters of the 14th international conference on World Wide Web - WWW '05, 2005
Web pages contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article, which distracts a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, including cell phone and PDA browsing, speech rendering for the visually impaired, reducing noise for information retrieval systems and to generally improve the web browsing experience. In our previous work [16], we developed a framework that employed an easily extensible set of techniques that incorporated results from our earlier work on content extraction . Our insight was to work with DOM trees, rather than raw HTML markup. We present here filters that reduce human involvement in applying heuristic settings for websites and instead automate the job by detecting and utilizing the physical layout and content genre of a given website. We also present work we have done towards improving the usability and performance of our content extraction proxy as well as the quality and accuracy of the heuristics that act as filters for inferring the context of a webpage.
Main Content Extraction from Heterogeneous Webpages
2018
Besides the main content, webpages often contain other complementary and noisy data such as advertisements, navigational information, copyright notices, and other template-related elements. The detection and extraction of main content can have many applications, such as web summarization, indexing, data mining, content adaptation to mobile devices, web content printing, etc. We introduce a novel site-level technique for content extraction based on the DOM representation of webpages. This technique analyzes some selected pages in any given website to identify those nodes in the DOM tree that do not belong to the webpage template. Then, an algorithm explores these nodes in order to select the main content nodes. To properly evaluate the technique, we have built a suite of benchmarks by downloading several heterogeneous real websites and manually marking the main content nodes. This suite of benchmarks can be used to evaluate and compare different content extraction techniques.
Extracting Content Blocks from Web Pages
User search for the required information using search engines. Search engines crawl and index web pages according to their informative content . User is interested only in the informative contents and not in non-informative content blocks. Web pages often contain navigation sidebars, advertisements, search blocks, copyright notices, etc which are not content blocks. The information contained in these non-content blocks can harm web mining. So it is important to separate the informative primary content blocks from non-informative blocks . In this paper three different algorithms for separating content blocks from non-content blocks developed by different authors are discussed. Removing non-informative content blocks from web pages can achieve significant storage and timing saving.
Methods For Extracting Content Blocks From Web Pages
The Web is perhaps the single largest data source in the world .The coverage of Web information is very wide and diverse. It has information which is of type required information by the user i.e. content blocks of the pages & the rest irrelevant information is termed as non content information or blocks like banner ads, navigation bars, and copyright notices. Web mining aims to extract and mine useful knowledge from the Web. But the non content blocks causes harm to web mining .So as to enhance web mining there is necessity of differentiate between contents & non contents blocks and to eliminate the non content blocks from web pages. So as to perform this task this paper deals with some techniques and methods which ultimately provides significant storage and timing saving by providing content blocks from web pages to user.
Optimized Content Extraction from web pages using Composite Approaches
The information available today on web is tremendous and comes with greater challenges. Content extraction identifies the main content and removes the clutter from web pages. The main problem in extracting the content from the web page is the newer architecture of web pages and the diversity in the structure of web pages. Optimized content extraction from HTML documents using collective approaches proposes a hybrid model that operates on Document Object Model (DOM) tree of the corresponding HTML document to extract the content accurately. It combines approaches and techniques like statistical features extraction, formatting characteristic. Content type identification is used along with collective approach to overcome problem of dealing with versatile web pages, and yielding to achieve more accuracy in extracting the contents.