ICDE Research Papers - Academia.edu (original) (raw)

This paper presents an architecture overview of the distributed, heterogeneous query processor (DHQP) in the Microsoft SQL Server database system to enable queries over a large collection of diverse data sources. The paper highlights... more

This paper presents an architecture overview of the distributed, heterogeneous query processor (DHQP) in the Microsoft SQL Server database system to enable queries over a large collection of diverse data sources. The paper highlights three salient aspects of the architecture. First, the system introduces welldefined abstractions such as connections, commands, and rowsets that enable sources to plug into the system. These abstractions are formalized by the OLE DB data access interfaces. The generality of OLE DB and its broad industry adoption enables our system to reach a very large collection of diverse data sources ranging from personal productivity tools, to database management systems, to file system data. Second, the DHQP is built-in to the relational optimizer and execution engine of the system. This enables DH queries and updates to benefit from the cost-based algebraic transformations and execution strategies available in the system. Finally, the architecture is inherently ex...

Vertex is a Wrapper Induction system developed at Yahoo! for extracting structured records from template-based Web pages. To operate at Web scale, Vertex employs a host of novel algorithms for (1) Grouping similar structured pages in a... more

Vertex is a Wrapper Induction system developed at Yahoo! for extracting structured records from template-based Web pages. To operate at Web scale, Vertex employs a host of novel algorithms for (1) Grouping similar structured pages in a Web site, (2) Picking the appropriate sample pages for wrapper inference, (3) Learning XPath-based extraction rules that are robust to variations in site structure, (4) Detecting site changes by monitoring sample pages, and (5) Optimizing editorial costs by reusing rules, etc. The system is deployed in production and currently extracts more than 250 million records from more than 200 Web sites. To the best of our knowledge, Vertex is the first system to do high-precision information extraction at Web scale.

Abstract Query answering using views amounts to computing the answer to a query having information only on the extension of a set of views. This problem is relevant in several fields, such as information integration, data warehousing,... more

Abstract Query answering using views amounts to computing the answer to a query having information only on the extension of a set of views. This problem is relevant in several fields, such as information integration, data warehousing, query optimization, mobile computing, and maintaining physical data independence. We address query answering using views in a context where queries and views are regular path queries, ie, regular expressions that denote the pairs of objects in the database connected by a matching path. Regular path ...

OLAP systems support data analysis through a multidimensional data model, according to which data facts are viewed as points in a space of application-related “dimensions”, organized into levels which conform to a hierarchy. The usual... more

OLAP systems support data analysis through a multidimensional data model, according to which data facts are viewed as points in a space of application-related “dimensions”, organized into levels which conform to a hierarchy. The usual assumption is that the data points reflect the dynamic aspect of the data warehouse, while dimensions are relatively static. However, in practice, dimension updates are often necessary to adapt the multidimensional database to changing requirements. Structural updates can also take place, like addition of categories or modification of the hierarchical structure. When these updates are performed, the materialized aggregate views that are typically stored in OLAP systems must be efficiently maintained. These updates are poorly supported (or not supported at all) in current commercial systems, and have received little attention in the research literature. We present a formal model of dimension updates in a multidimensional model, a collection of primitive operators to perform them, and a study of the effect of these updates on a class of materialized views, giving an algorithm to efficiently maintain them

Extensible Indexing is a SQL-based framework that allows users to define domain-specific indexing schemes, and integrate them into the Oracle8i server. Users register a new indexing scheme, the set of related operators, and additional... more

Extensible Indexing is a SQL-based framework that allows users to define domain-specific indexing schemes, and integrate them into the Oracle8i server. Users register a new indexing scheme, the set of related operators, and additional properties through SQL data ...

There are various computer architectures that will support database management applications. The distinction between the security concerns of a database management system and an operating system are not well defined. Both can provide same... more

There are various computer architectures that will support database management applications. The distinction between the security concerns of a database management system and an operating system are not well defined. Both can provide same data security to user applications. The question is how to divide the security controls between the two. This paper details the fundamental security requirements for a database management system, and the operating system security features that a database management system could take advantage of to enhance its own security. A metric for quantifying the security functions of an operating system, the Department of Defense Trusted Computer Systems Evaluation Criteria, is discussed, as is its potential application to database security assessment.

We consider an architecture of mediators and wrappers [8], for WebSources of limited capability in a wide area environment. We have developed a Web Query Optimizer (WQO) within the mediator, where the mediator has been developed as an... more

We consider an architecture of mediators and wrappers [8], for WebSources of limited capability in a wide area environment. We have developed a Web Query Optimizer (WQO) within the mediator, where the mediator has been developed as an extension of the Predator object-...

In a telecommunication network, hundreds of millions of call detail records (CDRs) are generated daily. Applications such as tandem traffic analysis require the collection and mining of CDRs on a continuous basis. The data volumes and... more

In a telecommunication network, hundreds of millions of call detail records (CDRs) are generated daily. Applications such as tandem traffic analysis require the collection and mining of CDRs on a continuous basis. The data volumes and data flow rates pose serious scalability and ...

There are various computer architectures that will support database management applications. The distinction between the security concerns of a database management system and an operating system are not well defined. Both can provide same... more

There are various computer architectures that will support database management applications. The distinction between the security concerns of a database management system and an operating system are not well defined. Both can provide same data security to user applications. The question is how to divide the security controls between the two. This paper details the fundamental security requirements for a database management system, and the operating system security features that a database management system could take advantage of to enhance its own security. A metric for quantifying the security functions of an operating system, the Department of Defense Trusted Computer Systems Evaluation Criteria, is discussed, as is its potential application to database security assessment.

In the ocean of Web data, Web search engines are the primary way to access content. As the data is on the order of petabytes, current search engines are very large centralized systems based on replicated clusters. Web data, however, is... more

In the ocean of Web data, Web search engines are the primary way to access content. As the data is on the order of petabytes, current search engines are very large centralized systems based on replicated clusters. Web data, however, is always evolving. The number of Web sites continues to grow rapidly and there are currently more than 20 billion indexed pages. In the near future, centralized systems are likely to become ineffective against such a load, thus suggesting the need of fully distributed search engines. Such ...