Thomas Seidl | RWTH Aachen University (original) (raw)

Papers by Thomas Seidl

Datenbank-Spektrum, 2013

ABSTRACT Der Lehrstuhl für Informatik 9 (Datenmanagement und -exploration) an der RWTH Aachen bes... more ABSTRACT Der Lehrstuhl für Informatik 9 (Datenmanagement und -exploration) an der RWTH Aachen beschäftigt sich mit Data Mining- und Datenbanktechnologien für multimediale und räumlich-zeitliche Daten in ingenieur-, natur-, lebens-, wirtschafts- und sozialwissenschaftlichen Anwendungen. Sowohl die große Menge an Daten als auch die Komplexität der einzelnen Objekte bergen unterschiedliche Herausforderungen für die Analyse und Exploration realer Daten, denen wir mit der Entwicklung neuer effektiver sowie effizienter Konzepte für Datenanalyse und Datenmanagement begegnen.

Proceedings of the Third International Workshop on Knowledge Discovery from Sensor Data - SensorKDD '09, 2009

Clustering is an established data mining technique for grouping objects based on similarity. For ... more Clustering is an established data mining technique for grouping objects based on similarity. For sensor networks one aims at grouping sensor measurements in groups of similar measurements. As sensor networks have limited resources in terms of available memory and energy, a major task sensor clustering is efficient computation on sensor nodes. As a dominating energy consuming task, communication has to be reduced for a better energy efficiency. Considering memory, one has to reduce the amount of stored information on each sensor node.

Proceedings of the 19th ACM international conference on Information and knowledge management - CIKM '10, 2010

ABSTRACT

Lecture Notes in Computer Science, 2014

ABSTRACT

2011 IEEE 27th International Conference on Data Engineering, 2011

Outlier mining is an important data analysis task to distinguish exceptional outliers from regula... more Outlier mining is an important data analysis task to distinguish exceptional outliers from regular objects. For outlier mining in the full data space, there are well established methods which are successful in measuring the degree of deviation for outlier ranking. However, in recent applications traditional outlier mining approaches miss outliers as they are hidden in subspace projections. Especially, outlier ranking approaches measuring deviation on all available attributes miss outliers deviating from their local neighborhood only in subsets of the attributes.

Proceedings of the 1st international workshop on Computer vision meets databases - CVDB '04, 2004

Multimedia databases get larger and larger in our days, and this trend is expected to continue in... more Multimedia databases get larger and larger in our days, and this trend is expected to continue in the future. There are various aspects that affect the demand for efficient database techniques to manage the flood of multimedia data, namely the increasing number of objects, the increasing complexity of objects, and the emergence of new query types. Whereas traditional indexing structures cope with large numbers of simple objects, complex multimedia objects require more sophisticated indexing techniques. In the tutorial, we discuss characteristics of multimedia data and multimedia queries including similarity range queries and k-nearest neighbor queries. The main focus is on efficient processing of k-nearest neighbor queries in various settings and includes direct k-NN search on indexes, multi-step k-NN query processing for complex distance functions and methods for high-dimensional spaces.

Lecture Notes in Computer Science, 2010

Subgraph mining algorithms aim at the detection of dense clusters in a graph. In recent years man... more Subgraph mining algorithms aim at the detection of dense clusters in a graph. In recent years many graph clustering methods have been presented. Most of the algorithms focus on undirected or unweighted graphs. In this work, we propose a novel model to determine the interesting subgraphs also for directed and weighted graphs. We use the method of density computation based on influence functions to identify dense regions in the graph. We present different types of interesting subgraphs. In experiments we show the high clustering quality of our GDens algorithm. GDens outperforms competing approaches in terms of quality and runtime.

2013 IEEE 13th International Conference on Data Mining Workshops, 2013

ABSTRACT Mining multivariate time series data by clustering is an important research topic. Time ... more ABSTRACT Mining multivariate time series data by clustering is an important research topic. Time series can be clustered by standard approaches like k-means, or by advanced methods such as subspace clustering and triclustering. A problem with these new methods is the lack of a general evaluation scheme that can be used by researchers to understand and compare the algorithms, publications on new algorithms mostly use different datasets and evaluation measures in their experiments, making comparisons with other algorithms rather unfair. In this demonstration, we present our ongoing work on an experimental framework that offers the means for extensive visualization and evaluation of time series clustering algorithms. It includes a multitude of methods from different clustering paradigms such as full space clustering, subspace clustering, and triclustering. It provides a flexible data generator that can simulate different scenarios, especially for temporal subspace clustering. It offers external evaluation measures and visualization features that allow for effective analysis and better understanding of the obtained clusterings. Our demonstration system is available on our website.

Proceedings of the 2010 SIAM International Conference on Data Mining, 2010

Analyzing uncertain databases is a challenge in data mining research. Usually, data mining method... more Analyzing uncertain databases is a challenge in data mining research. Usually, data mining methods rely on precise values. In scenarios where uncertain values occur, e.g. due to noisy sensor readings, these algorithms cannot deliver highquality patterns. Beside uncertainty, data mining methods face another problem: high dimensional data. For finding object groupings with locally relevant dimensions in this data, subspace clustering was introduced. For high dimensional uncertain data, however, deciding whether dimensions are relevant for a subspace cluster is even more challenging; thus, approaches for effective subspace clustering on uncertain databases are needed.

2012 IEEE 12th International Conference on Data Mining, 2012

Mining temporal multivariate data by clustering is an important research topic. In today's comple... more Mining temporal multivariate data by clustering is an important research topic. In today's complex data, interesting patterns are often neither bound to the whole dimensional nor temporal extent of the data domain. This challenge is met by temporal subspace clustering methods. Their effectiveness, however, is impeded by aspects unavoidable in real world data: Misalignments between time series, for example caused by out-of-sync sensors, and measurement errors. Under these conditions, existing temporal subspace clustering approaches miss the patterns contained in the data.

Lecture Notes in Computer Science, 2010

Massive amounts of video data from digital tv channels, online video communities, peer-to-peer ne... more Massive amounts of video data from digital tv channels, online video communities, peer-to-peer networks, and video blogs require automated techniques for copyright enforcement and usage tracking. Effective video copy distortion models usually incur high computational cost. We propose an index supported multistep filter-and-refine algorithm for a complex copy detection model. We characterize a class of filters for which we prove completeness of the result, and provide further runtime improvement by a novel tight approximation. In ...

Lecture Notes in Computer Science, 2001

Intervals represent a fundamental data type for temporal, scientific, and spatial databases where... more Intervals represent a fundamental data type for temporal, scientific, and spatial databases where time stamps and point data are extended to time spans and range data, respectively. For OLTP and OLAP applications on large amounts of data, not only intersection queries have to be processed efficiently but also general interval relationships including before, meets, overlaps, starts, finishes, contains, equals, during, startedBy, finishedBy, overlappedBy, metBy, and after. Our new algorithms use the Relational Interval Tree, a purely SQL-based and objectrelationally wrapped index structure. The technique therefore preserves the industrial strength of the underlying RDBMS including stability, transactions, and performance. The efficiency of our approach is demonstrated by an experimental evaluation on a real weblog data set containing one million sessions.

Proceedings of the 12th International Conference on Extending Database Technology Advances in Database Technology - EDBT '09, 2009

In this work we concentrate on categorization of relational attributes based on their data type. ... more In this work we concentrate on categorization of relational attributes based on their data type. Assuming that attribute type/characteristics are unknown or unidentifiable, we analyze and compare a variety of type-based signatures for classifying the attributes based on the semantic type of the data contained therein (e.g., router identifiers, social security numbers, email addresses). The signatures can subsequently be used for other applications as well, like clustering and index optimization/compression. This application is useful in cases where very large data collections that are generated in a distributed, ungoverned fashion end up having unknown, incomplete, inconsistent or very complex schemata and schema level meta-data. We concentrate on heuristically generating type-based attribute signatures based on both local and global computation approaches. We show experimentally that by decomposing data into q-grams and then considering signatures based on q-gram distributions, we achieve very good classification accuracy under the assumption that a large sample of the data is available for building the signatures. Then, we turn our attention to cases where a very small sample of the data is available, and hence accurately capturing the q-gram distribution of a given data type is almost impossible. We propose techniques based on dimensionality reduction and soft-clustering that exploit correlations between attributes to improve classification accuracy.

Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology, 1995

Protein docking is a new and challenging application for query processing in database systems. Ou... more Protein docking is a new and challenging application for query processing in database systems. Our architecture for an efficient support of docking queries is based on the multistep query processing paradigm, a technique well-known from spatial database system. Along with physicochemical parameters, the geometry of the molecules plays a fundamental role for docking retrieval. Thus, 3D structures and 3D surfaces of molecules are basic objects in molecular databases. We specify a molecular surface representation based on topology, define a class of neighborhood queries, and sketch some applications with respect to the docking problem. We suggest a patch-based data structure called the TriEdge structure, first, to efficiently support topological query processing, and second, to save space in comparison to common planar graph representations such as the quad-edge structure. In analogy to the quad-edge structure, the TriEdge structure has an algebraic interface and is implemented via com...

Proceedings of the VLDB Endowment, 2010

2010 IEEE International Conference on Data Mining Workshops, 2010

Large amounts of data are ubiquitous today. Data mining methods like clustering were introduced t... more Large amounts of data are ubiquitous today. Data mining methods like clustering were introduced to gain knowledge out of these data. Recently, detection of multiple clusterings has become an active research area, where several alternative clustering solutions are generated for a single dataset. Each of the obtained clustering solutions is valid, of importance, and provides a different interpretation of the data. The key for knowledge extraction is, however, to learn how the different solutions are related to each other. This can be achieved by a comparison and analysis of the obtained clustering solutions.

2010 IEEE International Conference on Data Mining, 2010

ABSTRACT Today's applications deal with multiple types of information: graph data to repr... more ABSTRACT Today's applications deal with multiple types of information: graph data to represent the relations between objects and attribute data to characterize single objects. Analyzing both data sources simultaneously can increase the quality of mining methods. Recently, combined clustering approaches were introduced, which detect densely connected node sets within one large graph that also show high similarity according to all of their attribute values. However, for attribute data it is known that this full-space clustering often leads to poor clustering results. Thus, subspace clustering was introduced to identify locally relevant subsets of attributes for each cluster. In this work, we propose a method for finding homogeneous groups by joining the paradigms of subspace clustering and dense sub graph mining, i.e. we determine sets of nodes that show high similarity in subsets of their dimensions and that are as well densely connected within the given graph. Our twofold clusters are optimized according to their density, size, and number of relevant dimensions. Our developed redundancy model confines the clustering to a manageable size of only the most interesting clusters. We introduce the algorithm Gamer for the efficient calculation of our clustering. In thorough experiments on synthetic and real world data we show that Gamer achieves low runtimes and high clustering qualities.

Proceedings of the 2004 ACM SIGMOD international conference on Management of data - SIGMOD '04, 2004

The increasing use of temporal and spatial data in presentday relational systems necessitates an ... more The increasing use of temporal and spatial data in presentday relational systems necessitates an efficient support of joins on interval-valued attributes. Standard join algorithms do not support those data types adequately, whereas special approaches for interval joins usually require an augmentation of the internal access methods which is not supported by existing relational systems. To overcome these problems we introduce new join algorithms for interval data. Based on the Relational Interval Tree, these algorithms can easily be implemented on top of any relational database system while providing excellent performance on joining intervals. As experimental results on an Oracle9i server show, the new techniques outperform existing relational methods for joining intervals significantly.

Technologies, Techniques and Trends, 2005

In order to generate efficient execution plans for queries comprising spatial data types and pred... more In order to generate efficient execution plans for queries comprising spatial data types and predicates, the database system has to be equipped with appropriate index structures, query processing methods, and optimization rules. Although available extensible indexing frameworks provide a gateway for seamless integration of spatial access methods into the standard process of query optimization and execution, they do not facilitate the actual implementation of the spatial access method itself. An internal enhancement of the database kernel is usually not an option for database developers. The embedding of a custom block-oriented index structure into concurrency control, recovery services and buffer management would cause extensive implementation efforts and maintenance cost, at the risk of weakening the reliability of the entire system. The server stability can be preserved by delegating index operations to an external process, but this approach induces severe performance bottlenecks due to context switches and inter-process communication. Therefore, we present the paradigm of object-relational spatial access methods that perfectly fits to the common relational data model and is highly compatible with the extensible indexing frameworks of existing object-relational database systems allowing the user to define application-specific access methods.

Datenbank-Spektrum, 2013

Proceedings of the Third International Workshop on Knowledge Discovery from Sensor Data - SensorKDD '09, 2009

Proceedings of the 19th ACM international conference on Information and knowledge management - CIKM '10, 2010

ABSTRACT

Lecture Notes in Computer Science, 2014

ABSTRACT

2011 IEEE 27th International Conference on Data Engineering, 2011

Proceedings of the 1st international workshop on Computer vision meets databases - CVDB '04, 2004

Lecture Notes in Computer Science, 2010

2013 IEEE 13th International Conference on Data Mining Workshops, 2013

Proceedings of the 2010 SIAM International Conference on Data Mining, 2010

2012 IEEE 12th International Conference on Data Mining, 2012

Lecture Notes in Computer Science, 2010

Lecture Notes in Computer Science, 2001

Proceedings of the 12th International Conference on Extending Database Technology Advances in Database Technology - EDBT '09, 2009

Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology, 1995

Proceedings of the VLDB Endowment, 2010

2010 IEEE International Conference on Data Mining Workshops, 2010

2010 IEEE International Conference on Data Mining, 2010

Proceedings of the 2004 ACM SIGMOD international conference on Management of data - SIGMOD '04, 2004

Technologies, Techniques and Trends, 2005