Using the words/leafs ratio in the DOM tree for content extraction (original) (raw)

Using the DOM Tree for Content Extraction

Electronic Proceedings in Theoretical Computer Science, 2012

The main information of a webpage is usually mixed between menus, advertisements, panels, and other not necessarily related information; and it is often difficult to automatically isolate this information. This is precisely the objective of content extraction, a research area of widely interest due to its many applications. Content extraction is useful not only for the final human user, but it is also frequently used as a preprocessing stage of different systems that need to extract the main content in a web document to avoid the treatment and processing of other useless information. Other interesting application where content extraction is particularly used is displaying webpages in small screens such as mobile phones or PDAs. In this work we present a new technique for content extraction that uses the DOM tree of the webpage to analyze the hierarchical relations of the elements in the webpage. Thanks to this information, the technique achieves a considerable recall and precision. Using the DOM structure for content extraction gives us the benefits of other approaches based on the syntax of the webpage (such as characters, words and tags), but it also gives us a very precise information regarding the related components in a block, thus, producing very cohesive blocks.

Main Content Extraction from Detailed Web Pages

International Journal of Computer Applications, 2010

As we know internet detailed web pages contains information which are not considered as primary content such as advertisements, headers, footers, navigation links and copyright information. Also information on web pages such as comments and reviews are not preferred by search engines to index as informative content, thereby having an algorithm to extracts only main content could help better quality on web page indexing. Almost all algorithms have been proposed are tag dependent means they could only look for primary content among specific tags such as

or
. The algorithm in this paper simulates a web page user visit and how the user finds the main content block position in the page. The proposed method is tag independent and has two phases to accomplish the extraction job. First it transforms input DOM tree obtained from input HTML detailed web page into a block tree based on their visual representation and DOM structure in a way that on every node it will have specification vector, then it traverses the obtained small block tree to find main block having dominant computed value in comparison with other block nodes based on its specification vector values. The introduced method doesn't have any learning phases and could find informative content on any random input detailed web page. This method has been tested in large variety of websites and as we will show, it gains better precision and recall based on other compared method K-FE.

Analysis of DOM Based Automatic Web Content Extraction

2013

The World Wide Web plays an important role while searching for information in the data network. This paper deals with research in the area of automatic extraction of textual and non-textual information. Developed method consists of two data types of extractions i.e. image and text data extraction. The extraction is performed using the concepts of Document Object Model (DOM) tree. The paper presents a series of data filters to detect and remove irrelevant data from the web page. Many web applications adopt AJAX to enhance their user experience. But AJAX has a number of properties making it extremely difficult for traditional search engines to crawl. The paper proposed an AJAX crawling scheme based on DOM and breadth-first AJAX crawling algorithm. keyword: DOM, Extraction images, Content Extraction.

Extracting context to improve accuracy for HTML content extraction

Special interest tracks and posters of the 14th international conference on World Wide Web - WWW '05, 2005

Web pages contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article, which distracts a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, including cell phone and PDA browsing, speech rendering for the visually impaired, reducing noise for information retrieval systems and to generally improve the web browsing experience. In our previous work [16], we developed a framework that employed an easily extensible set of techniques that incorporated results from our earlier work on content extraction . Our insight was to work with DOM trees, rather than raw HTML markup. We present here filters that reduce human involvement in applying heuristic settings for websites and instead automate the job by detecting and utilizing the physical layout and content genre of a given website. We also present work we have done towards improving the usability and performance of our content extraction proxy as well as the quality and accuracy of the heuristics that act as filters for inferring the context of a webpage.

Main Content Extraction from Heterogeneous Webpages

2018

Besides the main content, webpages often contain other complementary and noisy data such as advertisements, navigational information, copyright notices, and other template-related elements. The detection and extraction of main content can have many applications, such as web summarization, indexing, data mining, content adaptation to mobile devices, web content printing, etc. We introduce a novel site-level technique for content extraction based on the DOM representation of webpages. This technique analyzes some selected pages in any given website to identify those nodes in the DOM tree that do not belong to the webpage template. Then, an algorithm explores these nodes in order to select the main content nodes. To properly evaluate the technique, we have built a suite of benchmarks by downloading several heterogeneous real websites and manually marking the main content nodes. This suite of benchmarks can be used to evaluate and compare different content extraction techniques.

Extracting Content Blocks from Web Pages

User search for the required information using search engines. Search engines crawl and index web pages according to their informative content . User is interested only in the informative contents and not in non-informative content blocks. Web pages often contain navigation sidebars, advertisements, search blocks, copyright notices, etc which are not content blocks. The information contained in these non-content blocks can harm web mining. So it is important to separate the informative primary content blocks from non-informative blocks . In this paper three different algorithms for separating content blocks from non-content blocks developed by different authors are discussed. Removing non-informative content blocks from web pages can achieve significant storage and timing saving.

Methods For Extracting Content Blocks From Web Pages

The Web is perhaps the single largest data source in the world .The coverage of Web information is very wide and diverse. It has information which is of type required information by the user i.e. content blocks of the pages & the rest irrelevant information is termed as non content information or blocks like banner ads, navigation bars, and copyright notices. Web mining aims to extract and mine useful knowledge from the Web. But the non content blocks causes harm to web mining .So as to enhance web mining there is necessity of differentiate between contents & non contents blocks and to eliminate the non content blocks from web pages. So as to perform this task this paper deals with some techniques and methods which ultimately provides significant storage and timing saving by providing content blocks from web pages to user.

Optimized Content Extraction from web pages using Composite Approaches

The information available today on web is tremendous and comes with greater challenges. Content extraction identifies the main content and removes the clutter from web pages. The main problem in extracting the content from the web page is the newer architecture of web pages and the diversity in the structure of web pages. Optimized content extraction from HTML documents using collective approaches proposes a hybrid model that operates on Document Object Model (DOM) tree of the corresponding HTML document to extract the content accurately. It combines approaches and techniques like statistical features extraction, formatting characteristic. Content type identification is used along with collective approach to overcome problem of dealing with versatile web pages, and yielding to achieve more accuracy in extracting the contents.