Michael Berthold - Profile on Academia.edu (original) (raw)

Papers by Michael Berthold

IEEE Transactions on Parallel and Distributed Systems, Aug 1, 2006

In molecular biology, it is often desirable to find common properties in large numbers of drug ca... more In molecular biology, it is often desirable to find common properties in large numbers of drug candidates. One family of methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely, a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiverinitiated load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Institute's HIV-screening data set, where we were able to show close-to linear speedup in a network of workstations. The proposed approach also allows for dynamic resource aggregation in a nondedicated computational environment. These features make it suitable for large-scale, multidomain, heterogeneous environments, such as computational grids.

International Journal of Computational Intelligence Systems, 2013

In this paper we describe the open source data analytics platform KNIME, focusing particularly on... more In this paper we describe the open source data analytics platform KNIME, focusing particularly on extensions and modules supporting fuzzy sets and fuzzy learning algorithms such as fuzzy clustering algorithms, rule induction methods, and interactive clustering tools. In addition we outline a number of experimental extensions, which are not yet part of the open source release and present two illustrative examples from real world applications to demonstrate the power of the KNIME extensions.

Der Konstanz Information Miner -KNIME -ist eine modulare Daten-Analyse Umgebung, die ein einfache... more Der Konstanz Information Miner -KNIME -ist eine modulare Daten-Analyse Umgebung, die ein einfaches interaktives Erstellen und Ausführen datenflussorientierter Pipelines erlaubt. KNIME bietet als Lern-, Forschungs-und Kollaborations-Software eine ideale Plattform zur Anwendung von Daten-Transformations-, Visualisierungs-und Data-Mining Knoten. Durch seine erweiterbaren Schnittstellen ist es leicht möglich neue Algorithmen, aber auch bestehende Tools zu integrieren -u.a. sind Weka, das R-Project und CDK (Chemistry Development Kit) in KNIME verfügbar. 1 Überblick und Einleitung Modulare Daten-Analyse-Plattformen kommen in den letzten Jahren immer mehr zur Anwendung. Um die große Auswahl von Analyse-Methoden nutzbar zu machen, ist es wichtig, dass solche Umgebungen einfach und intuitiv anzuwenden sind, schnelle und interaktive Änderungen im Analyse-Prozess erlauben und dem Benutzer ermöglichen, die Ergebnisse visuell zu explorieren. Diese Möglichkeiten führten dazu, dass Data Pipelines stark an Bedeutung gewonnen haben. Werkzeuge dieser Art ermöglichen es dem Benutzer Analyseabläufe aus standardisierten Verarbeitungseinheiten, die miteinander verbunden sind und durch die Daten oder Modelle fließen, visuell zusammenzusetzen und zu verändern. Ein weiterer Vorteil dieser Systeme ist die intuitive und graphische Art die einzelnen Analyseschritte nachzuvollziehen. KNIME, der Konstanz Information Miner, bietet eine solche Pipeline Umgebung. Abbildung 1 zeigt den Screenshot eines Analyseablaufs. In der Mitte des Bildes ist ein Ablaufdiagramm zu erkennen: Es werden Daten von zwei Quellen eingelesen und in verschiedenen parallelen Zweigen, bestehend aus Daten-Vorverarbeitung, Modellbildung und Visualisierung verarbeitet. Eine Auswahl an Daten-und Modellverarbeitungsknoten sowie Visualisierungen ist auf der linken Seite zu sehen. Diese verschiedenen Module zum Einlesen von Daten-/Modellen, Vorverarbeitung, Modellbildung, Data-Mining-Algorithmen, sowie Visualisierung können leicht per Maus-Interaktion auf die Arbeitsfläche gezogen werden, wo sie mit anderen Knoten verbunden werden können.

Lecture Notes in Computer Science, 2005

Structured data represented in the form of graphs arises in several fields of the science and the... more Structured data represented in the form of graphs arises in several fields of the science and the growing amount of available data makes distributed graph mining techniques particularly relevant. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. The problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely a dynamic partitioning of the search space, a distribution process based on a peer-topeer communication framework, and a novel receiver-initiated, load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Institute's HIV-screening dataset, where the approach attains close-to linear speedup in a network of workstations.

Microprocessors and Microsystems, Jun 1, 2007

In this paper, we present a distributed computing framework for problems characterized by a highl... more In this paper, we present a distributed computing framework for problems characterized by a highly irregular search tree, whereby no reliable workload prediction is available. The framework is based on a peer-to-peer computing environment and dynamic load balancing. The system allows for dynamic resource aggregation, does not depend on any specific meta-computing middleware and is suitable for large-scale, multi-domain, heterogeneous environments, such as computational Grids. Dynamic load balancing policies based on global statistics are known to provide optimal load balancing performance, while randomized techniques provide high scalability . The proposed method combines both advantages and adopts distributed job-pools and a randomized polling technique. The framework has been successfully adopted in a parallel search algorithm for subgraph mining and evaluated on a molecular compounds dataset. The parallel application has shown good scalability and close-to linear speedup in a distributed network of workstations.

Springer eBooks, May 10, 2008

In this paper we outline an approach for network-based information access and exploration. In con... more In this paper we outline an approach for network-based information access and exploration. In contrast to existing methods, the presented framework allows for the integration of both semantically meaningful information as well as loosely coupled information fragments from heterogeneous information repositories. The resulting Bisociative Information Networks (BisoNets) together with explorative navigation methods facilitate the discovery of links across diverse domains. In addition to such "chains of evidence", they enable the user to go back to the original information repository and investigate the origin of each link, ultimately resulting in the discovery of previously unknown connections between information entities of different domains, subsequently triggering new insights and supporting creative discoveries.

In December 2008, version 2.0 of the data analysis platform KNIME was released. It includes sever... more In December 2008, version 2.0 of the data analysis platform KNIME was released. It includes several new features, which we will describe in this paper. We also provide a short introduction to KNIME for new users.

European Society for Fuzzy Logic and Technology Conference, 2009

This paper presents an approach for visualizing highdimensional fuzzy rules arranged in a hierarc... more This paper presents an approach for visualizing highdimensional fuzzy rules arranged in a hierarchy together with the training patterns they cover. A standard multi-dimensional scaling method is used to map the rule centers of the top hierarchy level to one coherent picture. Rules of the underlying levels are projected relatively to their parent level(s). In addition to the rules, all patterns are mapped onto the two-dimensional projection in relation to the positions of the corresponding rule centers. Visualization is further extended by showing hierarchical relationships between overlapping rules of different levels, which are generated by a hierarchical rule learner. This delivers interesting insights into the rule hierarchy and offers better explorative properties. Additionally, rules can be highlighted interactively emphasizing the subsequent rules at all underlying levels together with the patterns they cover. We demonstrate that this technique allows investigation of interesting rules at different levels of granularity, which makes this approach applicable even for a large number of rules. The proposed technique is illustrated and discussed based on a number of hierarchical rule model visualizations generated from well-known benchmark data sets.

Many fuzzy rule induction algorithms have been proposed in the past. Most of them tend to generat... more Many fuzzy rule induction algorithms have been proposed in the past. Most of them tend to generate too many rules tbJring the learning process. This is due to data sets obtained from real world systems containing distorted elements or noisy data. Most approaches try to completely ignore outliers, which can be potentially hannful since the exmnple may describe a rare but still extremely interesting phenomena in the data. In order to avoid this conflict, we propose to build a hierarrhy of fuzzy rule systems. The goal of this model-hierarchy are interpretable models with only few relevant rules on each level of the hierarchy. The resulting fuzzy model hierarchy forms a structure in which the top model covers all data explicitly and generates a significant smaller number of rules than the original . fuzzy rule learner. The models on the bottom. on the other hand, consist of only a few rules in each level and explain parts with only weak relevance in the data. We demonstrate the proposed method's usefulness on several classification benchmark data sets. The results demonstrate how the rule hierarrhy allows to build much smaller fuzzy rule systems and how the model-especially at higher levels of the hierarchy-remains interpretable.

Springer eBooks, 2008

The Konstanz Information Miner is a modular environment which enables easy visual assembly and in... more The Konstanz Information Miner is a modular environment which enables easy visual assembly and interactive execution of a data pipeline. It is designed as a teaching, research and collaboration platform, which enables easy integration of new algorithms, data manipulation or visualization methods as new modules or nodes. In this paper we describe some of the design aspects of the underlying architecture and briefly sketch how new nodes can be incorporated.

Lecture Notes in Computer Science, 2003

Rule systems have failed to attract much interest in large data analysis problems because they te... more Rule systems have failed to attract much interest in large data analysis problems because they tend to be too simplistic to be useful or consist of too many rules for human interpretation. We present a method that constructs a hierarchical rule system, with only a small number of rules at each stage of the hierarchy. Lower levels in this hierarchy focus on outliers or areas of the feature space where only weak evidence for a rule was found in the data. Rules further up, at higher levels of the hierarchy, describe increasingly general and strongly supported aspects of the data. We demonstrate the proposed method's usefulness on several classification benchmark data sets using a fuzzy rule induction process as the underlying learning algorithm. The results demonstrate how the rule hierarchy allows to build much smaller rule systems and how the model-especially at higher levels of the hierarchy-remains interpretable. The presented method can be applied to a variety of local learning systems in a similar fashion.

In real world applications sequential algorithms of data mining and data exploration are often un... more In real world applications sequential algorithms of data mining and data exploration are often unsuitable for datasets with enormous size, high-dimensionality and complex data structure. Grid computing promises unprecedented opportunities for unlimited computing and storage resources. In this context there is the necessity to develop high performance distributed data mining algorithms. However, the computational complexity of the problem and the large amount of data to be explored often make the design of large scale applications particularly challenging. In this paper we present the first distributed formulation of a frequent subgraph mining algorithm for discriminative fragments of molecular compounds. Two distributed approaches have been developed and compared on the wellknown National Cancer Institute's HIV-screening dataset. We present experimental results on a small-scale computing environment.

We hereby correct an error in Ref. [2], in which we studied the influence of various parameters t... more We hereby correct an error in Ref. [2], in which we studied the influence of various parameters that affect the generalization performance of fuzzy models constructed using the mixed fuzzy rule formation method [1]. On page 196, the last equation that computes the normalized loss in volume V normi contains an error. The last term of the formula must be replaced by the ratio of both distances: V normi di ð~x;RÞ

2006 IEEE International Conference on Fuzzy Systems, 2006

This paper presents an approach to visualizing and exploring high-dimensional rules in two-dimens... more This paper presents an approach to visualizing and exploring high-dimensional rules in two-dimensional views. The proposed method uses multi-dimensional scaling to place the rule centers and subsequently extends the rules' regions to depict their overlap. This results not only in a visualization of the rules' distribution but also enables the relationship to their immediate neighbors to be judged. The proposed technique is illustrated and discussed on a number of wellknown benchmark data sets.

2002 Annual Meeting of the North American Fuzzy Information Processing Society Proceedings. NAFIPS-FLINT 2002 (Cat. No. 02TH8622)

Lecture Notes in Computer Science, 2003

2005 IEEE International Conference on Systems, Man and Cybernetics

In this paper, we show how an existing fuzzy rule induction algorithm can incorporate missing val... more In this paper, we show how an existing fuzzy rule induction algorithm can incorporate missing values in the training procedure in a very natural way. The underlying algorithm generates rules which restrict the feature space only along a few, important attributes. This property can be used to limit the algorithm's three major steps to the reduced feature space for each training instance, which allows the features for which no values are known to be ignored. Hence no replacement is necessary and the algorithm simply uses all available knowledge from each training instance. We demonstrate on data sets from the UCI repository that this method works well, generates rule sets that have comparable classification accuracy, and are, at times, even smaller than the rule sets generated by the original algorithm.

International Journal of Computational Intelligence Systems, 2013

Rule systems have failed to attract much interest in large data analysis problems because they te... more Rule systems have failed to attract much interest in large data analysis problems because they tend to be too simplistic to be useful or consist of too many rules for human interpretation. We recently presented a method that constructs a hierarchical rule system, with only a small number of rules at each level of the hierarchy. Lower levels in this hierarchy focus on outliers or areas of the feature space where only weak evidence for a rule was found in the data. Rules further up, at higher levels of the hierarchy, describe increasingly general and strongly supported aspects of the data. In this paper we show how a connected set of parallel coordinate displays can be used to visually explore this hierarchy of rule systems and allows an intuitive mechanism to zoom in and out of the underlying model.

Many fuzzy rule induction algorithms have been proposed in the past. Most of them tend to generat... more Many fuzzy rule induction algorithms have been proposed in the past. Most of them tend to generate too many rides during the learning process. This is due to data sets obtained from real world systems containing distorted elements or noisy data. Most approaches try to completely ignore outliers, which can be potentially harmful since the example may describe a rare but still extremely interesting phenomena in the data. In order to avoid this conflict, we propose to build a hierarchy of fuzzy rule systems. The goal of this model-hierarchy are interpretable models with only few relevant rules on each level of the hierarchy. The resulting fuzzy model hierarchy forms a structure in which the top model covers all data explicitly and generates a significant smaller number of rules than the original fuzzy rule learner. The models on the bottom, on the other hand, consist of only a few rules in each level and explain pans with only weak relevance in the data. We demonstrate the proposed met...

IEEE Transactions on Parallel and Distributed Systems, Aug 1, 2006

International Journal of Computational Intelligence Systems, 2013

Lecture Notes in Computer Science, 2005

Microprocessors and Microsystems, Jun 1, 2007

Springer eBooks, May 10, 2008

European Society for Fuzzy Logic and Technology Conference, 2009

Springer eBooks, 2008

Lecture Notes in Computer Science, 2003

2006 IEEE International Conference on Fuzzy Systems, 2006

2002 Annual Meeting of the North American Fuzzy Information Processing Society Proceedings. NAFIPS-FLINT 2002 (Cat. No. 02TH8622)

Lecture Notes in Computer Science, 2003

2005 IEEE International Conference on Systems, Man and Cybernetics

International Journal of Computational Intelligence Systems, 2013

Rule systems have failed to attract much interest in large data analysis problems because they te... more Rule systems have failed to attract much interest in large data analysis problems because they tend to be too simplistic to be useful or consist of too many rules for human interpretation. We recently presented a method that constructs a hierarchical rule system, with only a small number of rules at each level of the hierarchy. Lower levels in this hierarchy focus on outliers or areas of the feature space where only weak evidence for a rule was found in the data. Rules further up, at higher levels of the hierarchy, describe increasingly general and strongly supported aspects of the data. In this paper we show how a connected set of parallel coordinate displays can be used to visually explore this hierarchy of rule systems and allows an intuitive mechanism to zoom in and out of the underlying model.

Many fuzzy rule induction algorithms have been proposed in the past. Most of them tend to generat... more Many fuzzy rule induction algorithms have been proposed in the past. Most of them tend to generate too many rides during the learning process. This is due to data sets obtained from real world systems containing distorted elements or noisy data. Most approaches try to completely ignore outliers, which can be potentially harmful since the example may describe a rare but still extremely interesting phenomena in the data. In order to avoid this conflict, we propose to build a hierarchy of fuzzy rule systems. The goal of this model-hierarchy are interpretable models with only few relevant rules on each level of the hierarchy. The resulting fuzzy model hierarchy forms a structure in which the top model covers all data explicitly and generates a significant smaller number of rules than the original fuzzy rule learner. The models on the bottom, on the other hand, consist of only a few rules in each level and explain pans with only weak relevance in the data. We demonstrate the proposed met...