Structured Querying of Web Text Data: A Technical Challenge (original) (raw)

Structured querying of Web text

2007

The Web contains a huge amount of text that is currently beyond the reach of structured access tools. This unstructured data often contains a substantial amount of implicit structure, much of which can be captured using information extraction (IE) algorithms. By combining an IE system with an appropriate data model and query language, we could enable structured access to all of the Web's unstructured data. We propose a general-purpose query system called the extraction database, or ExDB, which supports SQL-like structured queries over Web text. We also describe the technical challenges involved, motivated in part by our experiences with an early 90M-page prototype.

WEB-SCALE INFORMATION EXTRACTION FROM UNSTRUCTURED AND UNGRAMMATICAL DATA SOURCES

TJPRC, 2014

Information Extraction (IE) is the task of automatically extracting knowledge from text. The massive body of text now available on the World Wide Web presents an unprecedented opportunity for information extraction. However, information extraction on the Web is challenging due to the enormous variety of distinct concepts and structured expressed. The explosive growth and popularity of the worldwide web has resulted in a huge amount of information sources on the Internet. However, due to the heterogeneity and the lack of structure of Web information sources, access to this huge collection of information has been limited to browsing and searching. Information extraction from unstructured and ungrammatical text on the Web, such as classified Ads, Auction listings, and web postings forums. Since the data is unstructured and ungrammatical, this information extraction precludes the use of rule-based methods that rely on consistent structures within the text or natural language processing techniques that rely on grammar. Posts are full of useful information, as defined by the attributes that compose the entity within the post. Currently accessing the data within posts does not go much beyond keyword search. This is precisely because the ungrammatical and unstructured nature of posts makes extraction difficult, so the attributes remain embedded within the posts. These data sources are ungrammatical, since they do not conform to the proper rules of written language. Therefore, Natural Language Processing (NLP) based information extraction techniques are not appropriate. As more and more information comes online, the ability to process and understand this information becomes more and more crucial. Data integration attacks this problem by letting users query heterogeneous data sources within a unified query framework, combining the results to ease understanding. However, while data integration can integrate data from structured sources such as databases, semi-structured sources such as that extracted from Web pages, and even Web Services, this leaves out a large class of useful information: unstructured and ungrammatical data sources. We proposed a system based Machine Learning technique to obtain the structured data records from different unstructured and non-template based websites. The proposed approach will be implemented by collection of known entities and their attributes, which refer as “reference set," A reference set can be constructed from structured sources, such as databases, or scraped from semi-structured sources such as collections of Web pages. A reference set can even be constructed automatically from the unstructured, ungrammatical text itself. This project implements methods to exploit reference sets for extraction using machine learning techniques. The machine learning approach provides higher accuracy extractions and deals with ambiguous extractions, although at the cost of requiring human effort to label training data.

Information Extraction in Semantic, Highly-Structured, and Semi-Structured Web Sources

Polibits, 2014

The evolution of the Web from the original proposal made in 1989 can be considered one of the most revolutionary technological changes in centuries. During the past 25 years the Web has evolved from a static version to a fully dynamic and interoperable intelligent ecosystem. The amount of data produced during these few decades is enormous. New applications, developed by individual developers or small companies, can take advantage of both services and data already present on the Web. Data, produced by humans and machines, may be available in different formats and through different access interfaces. This paper analyses three different types of data available on the Web and presents mechanisms for accessing and extracting this information. The authors show several applications that leverage extracted information in two areas of research: recommendations of educational resources beyond content and interactive digital TV applications.

Scaling the Information Extraction from Unstructured and Ungrammatical Data Sources on Web

2014

Information Extraction (IE) on the web is the task of automatically extracting knowledge from text. Web Information Extraction (WIE) systems have recently been able to extract massive quantities of relational data from online text. This massive body of text which are now available on the World Wide Web do presents an unparalleled opportunity for information extraction. However, this information extraction on the Web is challenging due to the vast variety of distinct concepts and structured expressed. The explosive growth and popularity of the worldwide web has resulted in a huge amount of information sources on the Internet. However, due to the heterogeneity, diversity and the lack of structure of Web information sources, access to this huge collection of information has been limited to browsing and searching.

Structured queries over web text

2006

The Web contains a vast amount of text that can only be queried using simple keywords-in, documentsout search queries. But Web text often contains structured elements, such as hotel location and price pairs embedded in a set of hotel reviews. Queries that process these structural text elements would be much more powerful than our current document-centric queries. Of course, text does not contain metadata or a schema, making it unclear what a structured text query means precisely. In this paper we describe three possible models for structured queries over text, each of which implies different query semantics and user interaction.

HWPDE: Novel Approach for Data Extraction from Structured Web Pages

2013

Diving into the World Wide Web for the purpose of fetching precious stones (relevant information) is a tedious task under the limitations of current diving equipments (Current Browsers). While a lot of work is being carried out to improve the quality of diving equipments, a related area of research is to devise a novel approach for mining. This paper describes a novel approach to extract the web data from the hidden websites so that it can be used as a free service to a user for a better and improved experience of searching relevant data. Through the proposed method, relevant data (Information) contained in the web pages of hidden websites is extracted by the crawler and stored in the local database so as to build a large repository of structured and indexed and ultimately relevant data. Such kind of extracted data has a potential to optimally satisfy the relevant Information starving end user.

A Framework for Extracting Information from Semi-Structured Web Data Sources

2008

Nowadays, many users use web search engines to find and gather information. User faces an increasing amount of various semi-structured information sources. The issue of correlating, integrating and presenting related information to users becomes important. When a user uses a search engine such as Yahoo and Google to seek a specific information, the results are not only information about the availability of the desired information, but also information about other pages on which the desired information is mentioned. The number of selected pages is enormous. Therefore, the performance capabilities, the overlap among results for the same queries and limitations of web search engines are an important and large area of research. Extracting information from the web data sources also becomes very important because the massive and increasing amount of diverse semi-structured information sources in the Internet that are available to users, and the variety of web pages making the process of information extraction from web a challenging problem. This paper proposes a framework for extracting, classifying and browsing semi-structured web data sources. The framework is able to extract relevant information from different web data sources, and classify the extracted information based on the standard classification of Nokia products.

EXTRACT AND ANALYSIS OF SEMI STRUCTURED DATA FROM WEBSITES AND DOCUMENTS

Discovering into the W3 consortium and portable documents for the purpose of fetching useful information is a hectic task under the limitations of current available browsers. While a huge amount of work is being carried out to improve the efficiency. The huge amount of information on web and portable digital document is stored in backend databases which are not indexed by traditional search engines such databases are known as Semi structured Databases and extraction and analysis of web content and documents is a time consuming and complex task. Hence, there has been increased interest in retrieval and integration of semi structured web data and digital document data with a view to improve quality information to the users who wish to analyze the data. This paper states an approach that identifies web page templates and the tag structures of a portable document in order to sort semi structured data from web pages and documents and analyze the fetching data as per user requirement using various SQL queries. Keyword: Web page extraction, Analysis, Web Page Service, portable documents