Katja Hose | Aalborg University (original) (raw)
Papers by Katja Hose
Large-scale knowledge graphs such as those in the Linked Data cloud are typically represented as ... more Large-scale knowledge graphs such as those in the Linked Data cloud are typically represented as subject-predicate-object triples. However, many facts about the world involve more than two entities. While n-ary relations can be converted to triples in a number of ways, unfortunately, the structurally different choices made in different knowledge sources significantly impede our ability to connect them. They also make it impossible to query the data concisely and without prior knowledge of each individual source. We present FrameBase, a wide-coverage knowledge-base schema that uses linguistic frames to seamlessly represent and query n-ary relations from other knowledge bases, at different levels of granularity connected by logical entailment. It also opens possibilities to draw on natural language processing techniques for querying and data mining.
Information Systems, 2008
Peer Data Management Systems (Pdms) are a novel, useful, but challenging paradigm for distributed... more Peer Data Management Systems (Pdms) are a novel, useful, but challenging paradigm for distributed data management and query processing. Conventional integrated information systems have a hierarchical structure with an integration component that manages a global schema and distributes queries against this schema to the underlying data sources. Pdms are a natural extension to this architecture by allowing each participating system (peer) to act both as a data source and as an integrator. Peers are interconnected by schema mappings, which guide the rewriting of queries between the heterogeneous schemas, and thus form a P2P (peer-to-peer)-like network.Despite several years of research, the development of efficient Pdms still holds many challenges. In this article we first survey the state of the art on peer data management: We classify Pdms by characteristics concerning their system model, their semantics, their query planning schemes, and their maintenance. Then we systematically examine open research directions in each of those areas. In particular, we observe that research results from both the domain of P2P systems and of conventional distributed data management can have an impact on the development of Pdms.
Typical approaches for querying structured Web Data collect (crawl) and pre-process (index) large... more Typical approaches for querying structured Web Data collect (crawl) and pre-process (index) large amounts of data in a central data repository before allowing for query answering. This time-consuming pre-processing phase however leverages the benefits of Linked Data -where structured data is accessible live and up-to-date at distributed Web resources that may change constantly -only to a limited degree, as query results can never be up-to-date. An ideal query answering system for Linked Data should return current answers in a reasonable amount of time, even on corpora as large as the Web. Query processors evaluating queries directly on the live sources require knowledge of the contents of data sources. In this paper, we develop and evaluate an approximate index structure summarising graph-structured content of sources adhering to Linked Data principles, provide an algorithm for answering conjunctive queries over Linked Data on the Web exploiting the source summary, and evaluate the system using synthetically generated queries. The experimental results show that our lightweight index structure enables complete and up-to-date query results over Linked Data, while keeping the overhead for querying low and providing a satisfying source ranking "for free".
Proceedings of The Vldb Endowment, 2008
OLAP servers based on relational backends typically exploit materialized aggregate tables to impr... more OLAP servers based on relational backends typically exploit materialized aggregate tables to improve response times of complex analytical queries. One of the key problems in this context is the view selection problem: choosing the optimal set of aggregation tables (called configuration) for a given workload. In this paper, we present a system that continuously monitors the workload and raises a quantified alert, when a better configuration is available. We address the tasks of query monitoring and view selection at the OLAP level instead of the SQL level, which simplifies the containment checks as well as rewriting and in this way helps to reduce the complexity of the backend system. At the demo we plan to show how our system works, i.e., how the system reacts upon arbitrary (interactive) workloads and how the user is alerted that a better configuration is available.
Typical approaches for querying structured Web Data collect (crawl) and pre-process (index) large... more Typical approaches for querying structured Web Data collect (crawl) and pre-process (index) large amounts of data in a central data repository before allowing for query answering. However, this time-consuming pre-processing phase however leverages the benefits of Linked Data -where structured data is accessible live and up-to-date at distributed Web resources that may change constantly -only to a limited degree, as query results can never be current. An ideal query answering system for Linked Data should return current answers in a reasonable amount of time, even on corpora as large as the Web. Query processors evaluating queries directly on the live sources require knowledge of the contents of data sources. In this paper, we develop and evaluate an approximate index structure summarising graph-structured content of sources adhering to Linked Data principles, provide an algorithm for answering conjunctive queries over Linked Data on the Web exploiting the source summary, and evaluate the system using synthetically generated queries. The experimental results show that our lightweight index structure enables complete and up-to-date query results over Linked Data, while keeping the overhead for querying low and providing a satisfying source ranking at no additional cost.
Peer Data Management Systems (PDMS) have recently attracted attention by the database community. ... more Peer Data Management Systems (PDMS) have recently attracted attention by the database community. One of the main challenges of this paradigm is the development and evaluation of indexing and query processing strategies for large-scale networks. So far, research groups working in this area build their own testing environment which first causes a huge effort and second makes it difficult to compare different strategies. In this demonstration paper, we present a simulation environment that aims to be an extensible platform for experimenting with query processing techniques in PDMS and allows for running large simulation experiments in distributed environments such as workstation clusters or even PlanetLab. In the demonstration we plan to show the evaluation of processing strategies for queries with specialized operators like top-k and skyline computation on structured data.
Computer Science - Research and Development, 2009
In many application scenarios, for example in design or media production processes, several autho... more In many application scenarios, for example in design or media production processes, several authors have to work cooperatively on the same project and consequently on the same data. In this context, a frequently used data format is XML. To enable cooperative authoring of shared XML graph structures, several requirements have to be fulfilled, e.g., early visibility of updates, multi-directional information flow, and processing data in parallel. Most transaction models proposed in the literature are hardly applicable in this context. In this paper, we propose a novel transaction model based on multi-level transactions and dynamic actions that meets these requirements. We describe the transaction model as well as its formal properties and discuss issues such as synchronization and logging.
Page 1. Distributed Query Processing in P2P Systems with incomplete schema information Marcel Kar... more Page 1. Distributed Query Processing in P2P Systems with incomplete schema information Marcel Karnstedt Katja Hose Kai-Uwe Sattler ... No mediation takes place. The Piazza system ([4]) is based on schema mappings between the participating peers. ...
Efficient query processing in P2P systems poses a variety of challenges mainly resulting from the... more Efficient query processing in P2P systems poses a variety of challenges mainly resulting from the strict decentralization and limited knowledge. Particularly with regard to queries involving ranking, top-N or skylines, existing approaches for centralized systems cannot be applied easily to P2P environments. In this paper, we focus on the problem of efficiently processing skyline queries in large-scale P2P systems, where it is nearly impossible to guarantee complete and exact query answers without exhaustive search, i.e., flooding the network. Thus, applying approximate query answering techniques, that are also typical for processing top-N queries in centralized database environments, seems to be the natural choice. We address this problem by presenting an approach that allows for reducing the number of queried peers as well as for giving probabilistic guarantees for the correctness of the answer.
ABSTRACT Efficient query processing in P2P-based Web integration systems poses a variety of chall... more ABSTRACT Efficient query processing in P2P-based Web integration systems poses a variety of challenges resulting from the strict decentralization and limited knowledge. As a special problem in this context we consider the evaluation of top-N queries on structured ...
Sensor networks have evolved to a powerful infrastructure component for event monitoring in many ... more Sensor networks have evolved to a powerful infrastructure component for event monitoring in many application scenarios. In addition to simple filter and aggregation operations, an important task in processing sensor data is data mining - the identification of relevant information and patterns. Limited capabilities of sensor nodes in terms of storage and processing capacity, battery lifetime, and communication demand a power-efficient, preferably sensor-local processing. In this paper, we present AnduIN, a system for developing, deploying, and running in-network data mining tasks. The system consists of a data stream processing engine, a library of operators for sensor-local processing, a box-and-arrow editor for specifying data mining tasks and deployment, a GUI providing the user with current information about the network and running queries, and an alerter notifying the user if a better query execution plan is available. At the demonstration site, we plan to show our system in action using burst detection as example application.
Wireless sensor networks have become important architectures for many application scenarios, e.g.... more Wireless sensor networks have become important architectures for many application scenarios, e.g., traffic monitoring or environmental monitoring in general. As these sensors are battery-powered, query processing strategies aim at minimizing energy consumption. Because sending all sensor readings to a central stream data management system consumes too much energy, parts of the query can already be processed within the network (in-network query processing). An important optimization criterion in this context is where to process which intermediate results and how to route them efficiently. To overcome these problems, we propose AnduIN, a system addressing these problems and offering an optimizer that decides which parts of the query should be processed within the sensor network. It also considers optimization with respect to complex data analysis tasks, such as burst detection. Furthermore, An-duIN offers a Web-based frontend for declarative query formulation and deployment. In this paper, we present our research prototype and focus on AnduIN's components alleviating deployment and usability.
Recently, the peer-to-peer (P2P) paradigm has emerged, mainly by file sharing systems such as Nap... more Recently, the peer-to-peer (P2P) paradigm has emerged, mainly by file sharing systems such as Napster and Gnutella and in terms of scalable distributed data structures. Due to the decentralization, P2P systems promise an improved robustness and scalability and therefore open also a new view on data integration solutions. However, several design and technical challenges arise in building scalable P2Pbased integration systems. In this paper, we address one of them: the problem of distributed query processing. We discuss strategies of query decomposition and routing based on different kinds of routing indexes and present results of an experimental evaluation.
As P2P systems are a very popular approach to connect a possibly large number of peers, efficient... more As P2P systems are a very popular approach to connect a possibly large number of peers, efficient query processing plays an important role. Appropriate strategies have to take the characteristics of these systems into account. Due to the possibly large number of peers, extensive flooding is not possible. The application of routing indexes is a commonly used technique to avoid flooding. Promising techniques to further reduce execution costs are query operators such as top-N and skyline, constraints, and the relaxation of exactness and/or completeness. In this paper, we propose strategies that take all these aspects into account. The choice is left to the user if and to what extent he is willing to relax exactness or apply constraints. We provide a thorough evaluation that uses two types of distributed data summaries as examples for routing indexes.
Peer Data Management Systems (PDMS) currently gain attention at an emerging scale in order to cop... more Peer Data Management Systems (PDMS) currently gain attention at an emerging scale in order to cope with the needs of growing organizational integration. Efficient query processing, as one of the main requirements in these systems, provides three major challenges: achieving robustness, scalability and self organization. In this paper we deal with the physical aspects of these requirements. We introduce an adaptive maintenance technique based on query feedback for keeping routing filters, used to optimize routing, up-to-date. These filters are applied in conjunction with an iterative query processing strategy and we show that this can improve robustness and scalability of query processing in distributed data management systems.
Evolving from heterogeneous database systems one of the main problems in Peer Data Management Sys... more Evolving from heterogeneous database systems one of the main problems in Peer Data Management Systems (PDMS) is distributed query processing. With the absence of global knowledge such strategies have to focus on routing the query efficiently to only those peers that are most likely to contribute to the final result. Using routing indexes is one possibility to achieve this. Since data may change over time these structures have to be updated and maintained which can be very expensive. In this paper, we present a novel kind of routing indexes that enables efficient query routing. Furthermore, we propose a threshold based update strategy that can help to reduce maintenance costs by far. We exemplify the benefit of these indexes using a distributed skyline strategy as an example. Finally, we show how relaxing exactness requirements, that are usually posed on results, can compensate the use of slightly outdated index information.
... In ICDCS '02, page 23, 2002. [HJKS06] Katja Hose, Andreas Job, Marcel Ka... more ... In ICDCS '02, page 23, 2002. [HJKS06] Katja Hose, Andreas Job, Marcel Karnstedt, and Kai-Uwe Sattler. ... [TIM+03] I. Tatarinov, Z. Ives, J. Madhavan, A. Halevy, D. Suciu, N. Dalvi, X. Dong, Y. Kadiyska, G. Miklau, and P. Mork. The Piazza Peer Data Management Project. ...
Efficient query processing in P2P systems poses a variety of challenges. As a special problem in ... more Efficient query processing in P2P systems poses a variety of challenges. As a special problem in this context we consider the evaluation of rank-aware queries, namely top-N and skyline, on structured data. The optimization of query processing in a distributed manner at each peer requires locally available statistics. In this paper, we address this problem by presenting approaches relying on the R-tree and histogram-based index structures. We show how this allows for optimizing rank-aware queries even over multiple attributes and thus significantly enhances the efficiency of query processing.
Large-scale knowledge graphs such as those in the Linked Data cloud are typically represented as ... more Large-scale knowledge graphs such as those in the Linked Data cloud are typically represented as subject-predicate-object triples. However, many facts about the world involve more than two entities. While n-ary relations can be converted to triples in a number of ways, unfortunately, the structurally different choices made in different knowledge sources significantly impede our ability to connect them. They also make it impossible to query the data concisely and without prior knowledge of each individual source. We present FrameBase, a wide-coverage knowledge-base schema that uses linguistic frames to seamlessly represent and query n-ary relations from other knowledge bases, at different levels of granularity connected by logical entailment. It also opens possibilities to draw on natural language processing techniques for querying and data mining.
Information Systems, 2008
Peer Data Management Systems (Pdms) are a novel, useful, but challenging paradigm for distributed... more Peer Data Management Systems (Pdms) are a novel, useful, but challenging paradigm for distributed data management and query processing. Conventional integrated information systems have a hierarchical structure with an integration component that manages a global schema and distributes queries against this schema to the underlying data sources. Pdms are a natural extension to this architecture by allowing each participating system (peer) to act both as a data source and as an integrator. Peers are interconnected by schema mappings, which guide the rewriting of queries between the heterogeneous schemas, and thus form a P2P (peer-to-peer)-like network.Despite several years of research, the development of efficient Pdms still holds many challenges. In this article we first survey the state of the art on peer data management: We classify Pdms by characteristics concerning their system model, their semantics, their query planning schemes, and their maintenance. Then we systematically examine open research directions in each of those areas. In particular, we observe that research results from both the domain of P2P systems and of conventional distributed data management can have an impact on the development of Pdms.
Typical approaches for querying structured Web Data collect (crawl) and pre-process (index) large... more Typical approaches for querying structured Web Data collect (crawl) and pre-process (index) large amounts of data in a central data repository before allowing for query answering. This time-consuming pre-processing phase however leverages the benefits of Linked Data -where structured data is accessible live and up-to-date at distributed Web resources that may change constantly -only to a limited degree, as query results can never be up-to-date. An ideal query answering system for Linked Data should return current answers in a reasonable amount of time, even on corpora as large as the Web. Query processors evaluating queries directly on the live sources require knowledge of the contents of data sources. In this paper, we develop and evaluate an approximate index structure summarising graph-structured content of sources adhering to Linked Data principles, provide an algorithm for answering conjunctive queries over Linked Data on the Web exploiting the source summary, and evaluate the system using synthetically generated queries. The experimental results show that our lightweight index structure enables complete and up-to-date query results over Linked Data, while keeping the overhead for querying low and providing a satisfying source ranking "for free".
Proceedings of The Vldb Endowment, 2008
OLAP servers based on relational backends typically exploit materialized aggregate tables to impr... more OLAP servers based on relational backends typically exploit materialized aggregate tables to improve response times of complex analytical queries. One of the key problems in this context is the view selection problem: choosing the optimal set of aggregation tables (called configuration) for a given workload. In this paper, we present a system that continuously monitors the workload and raises a quantified alert, when a better configuration is available. We address the tasks of query monitoring and view selection at the OLAP level instead of the SQL level, which simplifies the containment checks as well as rewriting and in this way helps to reduce the complexity of the backend system. At the demo we plan to show how our system works, i.e., how the system reacts upon arbitrary (interactive) workloads and how the user is alerted that a better configuration is available.
Typical approaches for querying structured Web Data collect (crawl) and pre-process (index) large... more Typical approaches for querying structured Web Data collect (crawl) and pre-process (index) large amounts of data in a central data repository before allowing for query answering. However, this time-consuming pre-processing phase however leverages the benefits of Linked Data -where structured data is accessible live and up-to-date at distributed Web resources that may change constantly -only to a limited degree, as query results can never be current. An ideal query answering system for Linked Data should return current answers in a reasonable amount of time, even on corpora as large as the Web. Query processors evaluating queries directly on the live sources require knowledge of the contents of data sources. In this paper, we develop and evaluate an approximate index structure summarising graph-structured content of sources adhering to Linked Data principles, provide an algorithm for answering conjunctive queries over Linked Data on the Web exploiting the source summary, and evaluate the system using synthetically generated queries. The experimental results show that our lightweight index structure enables complete and up-to-date query results over Linked Data, while keeping the overhead for querying low and providing a satisfying source ranking at no additional cost.
Peer Data Management Systems (PDMS) have recently attracted attention by the database community. ... more Peer Data Management Systems (PDMS) have recently attracted attention by the database community. One of the main challenges of this paradigm is the development and evaluation of indexing and query processing strategies for large-scale networks. So far, research groups working in this area build their own testing environment which first causes a huge effort and second makes it difficult to compare different strategies. In this demonstration paper, we present a simulation environment that aims to be an extensible platform for experimenting with query processing techniques in PDMS and allows for running large simulation experiments in distributed environments such as workstation clusters or even PlanetLab. In the demonstration we plan to show the evaluation of processing strategies for queries with specialized operators like top-k and skyline computation on structured data.
Computer Science - Research and Development, 2009
In many application scenarios, for example in design or media production processes, several autho... more In many application scenarios, for example in design or media production processes, several authors have to work cooperatively on the same project and consequently on the same data. In this context, a frequently used data format is XML. To enable cooperative authoring of shared XML graph structures, several requirements have to be fulfilled, e.g., early visibility of updates, multi-directional information flow, and processing data in parallel. Most transaction models proposed in the literature are hardly applicable in this context. In this paper, we propose a novel transaction model based on multi-level transactions and dynamic actions that meets these requirements. We describe the transaction model as well as its formal properties and discuss issues such as synchronization and logging.
Page 1. Distributed Query Processing in P2P Systems with incomplete schema information Marcel Kar... more Page 1. Distributed Query Processing in P2P Systems with incomplete schema information Marcel Karnstedt Katja Hose Kai-Uwe Sattler ... No mediation takes place. The Piazza system ([4]) is based on schema mappings between the participating peers. ...
Efficient query processing in P2P systems poses a variety of challenges mainly resulting from the... more Efficient query processing in P2P systems poses a variety of challenges mainly resulting from the strict decentralization and limited knowledge. Particularly with regard to queries involving ranking, top-N or skylines, existing approaches for centralized systems cannot be applied easily to P2P environments. In this paper, we focus on the problem of efficiently processing skyline queries in large-scale P2P systems, where it is nearly impossible to guarantee complete and exact query answers without exhaustive search, i.e., flooding the network. Thus, applying approximate query answering techniques, that are also typical for processing top-N queries in centralized database environments, seems to be the natural choice. We address this problem by presenting an approach that allows for reducing the number of queried peers as well as for giving probabilistic guarantees for the correctness of the answer.
ABSTRACT Efficient query processing in P2P-based Web integration systems poses a variety of chall... more ABSTRACT Efficient query processing in P2P-based Web integration systems poses a variety of challenges resulting from the strict decentralization and limited knowledge. As a special problem in this context we consider the evaluation of top-N queries on structured ...
Sensor networks have evolved to a powerful infrastructure component for event monitoring in many ... more Sensor networks have evolved to a powerful infrastructure component for event monitoring in many application scenarios. In addition to simple filter and aggregation operations, an important task in processing sensor data is data mining - the identification of relevant information and patterns. Limited capabilities of sensor nodes in terms of storage and processing capacity, battery lifetime, and communication demand a power-efficient, preferably sensor-local processing. In this paper, we present AnduIN, a system for developing, deploying, and running in-network data mining tasks. The system consists of a data stream processing engine, a library of operators for sensor-local processing, a box-and-arrow editor for specifying data mining tasks and deployment, a GUI providing the user with current information about the network and running queries, and an alerter notifying the user if a better query execution plan is available. At the demonstration site, we plan to show our system in action using burst detection as example application.
Wireless sensor networks have become important architectures for many application scenarios, e.g.... more Wireless sensor networks have become important architectures for many application scenarios, e.g., traffic monitoring or environmental monitoring in general. As these sensors are battery-powered, query processing strategies aim at minimizing energy consumption. Because sending all sensor readings to a central stream data management system consumes too much energy, parts of the query can already be processed within the network (in-network query processing). An important optimization criterion in this context is where to process which intermediate results and how to route them efficiently. To overcome these problems, we propose AnduIN, a system addressing these problems and offering an optimizer that decides which parts of the query should be processed within the sensor network. It also considers optimization with respect to complex data analysis tasks, such as burst detection. Furthermore, An-duIN offers a Web-based frontend for declarative query formulation and deployment. In this paper, we present our research prototype and focus on AnduIN's components alleviating deployment and usability.
Recently, the peer-to-peer (P2P) paradigm has emerged, mainly by file sharing systems such as Nap... more Recently, the peer-to-peer (P2P) paradigm has emerged, mainly by file sharing systems such as Napster and Gnutella and in terms of scalable distributed data structures. Due to the decentralization, P2P systems promise an improved robustness and scalability and therefore open also a new view on data integration solutions. However, several design and technical challenges arise in building scalable P2Pbased integration systems. In this paper, we address one of them: the problem of distributed query processing. We discuss strategies of query decomposition and routing based on different kinds of routing indexes and present results of an experimental evaluation.
As P2P systems are a very popular approach to connect a possibly large number of peers, efficient... more As P2P systems are a very popular approach to connect a possibly large number of peers, efficient query processing plays an important role. Appropriate strategies have to take the characteristics of these systems into account. Due to the possibly large number of peers, extensive flooding is not possible. The application of routing indexes is a commonly used technique to avoid flooding. Promising techniques to further reduce execution costs are query operators such as top-N and skyline, constraints, and the relaxation of exactness and/or completeness. In this paper, we propose strategies that take all these aspects into account. The choice is left to the user if and to what extent he is willing to relax exactness or apply constraints. We provide a thorough evaluation that uses two types of distributed data summaries as examples for routing indexes.
Peer Data Management Systems (PDMS) currently gain attention at an emerging scale in order to cop... more Peer Data Management Systems (PDMS) currently gain attention at an emerging scale in order to cope with the needs of growing organizational integration. Efficient query processing, as one of the main requirements in these systems, provides three major challenges: achieving robustness, scalability and self organization. In this paper we deal with the physical aspects of these requirements. We introduce an adaptive maintenance technique based on query feedback for keeping routing filters, used to optimize routing, up-to-date. These filters are applied in conjunction with an iterative query processing strategy and we show that this can improve robustness and scalability of query processing in distributed data management systems.
Evolving from heterogeneous database systems one of the main problems in Peer Data Management Sys... more Evolving from heterogeneous database systems one of the main problems in Peer Data Management Systems (PDMS) is distributed query processing. With the absence of global knowledge such strategies have to focus on routing the query efficiently to only those peers that are most likely to contribute to the final result. Using routing indexes is one possibility to achieve this. Since data may change over time these structures have to be updated and maintained which can be very expensive. In this paper, we present a novel kind of routing indexes that enables efficient query routing. Furthermore, we propose a threshold based update strategy that can help to reduce maintenance costs by far. We exemplify the benefit of these indexes using a distributed skyline strategy as an example. Finally, we show how relaxing exactness requirements, that are usually posed on results, can compensate the use of slightly outdated index information.
... In ICDCS '02, page 23, 2002. [HJKS06] Katja Hose, Andreas Job, Marcel Ka... more ... In ICDCS '02, page 23, 2002. [HJKS06] Katja Hose, Andreas Job, Marcel Karnstedt, and Kai-Uwe Sattler. ... [TIM+03] I. Tatarinov, Z. Ives, J. Madhavan, A. Halevy, D. Suciu, N. Dalvi, X. Dong, Y. Kadiyska, G. Miklau, and P. Mork. The Piazza Peer Data Management Project. ...
Efficient query processing in P2P systems poses a variety of challenges. As a special problem in ... more Efficient query processing in P2P systems poses a variety of challenges. As a special problem in this context we consider the evaluation of rank-aware queries, namely top-N and skyline, on structured data. The optimization of query processing in a distributed manner at each peer requires locally available statistics. In this paper, we address this problem by presenting approaches relying on the R-tree and histogram-based index structures. We show how this allows for optimizing rank-aware queries even over multiple attributes and thus significantly enhances the efficiency of query processing.