Aleksey Buzmakov - Academia.edu (original) (raw)
Papers by Aleksey Buzmakov
Discrete Applied Mathematics
With an increased interest in machine processable data and with the progress of semantic technolo... more With an increased interest in machine processable data and with the progress of semantic technologies, many datasets are now published in the form of RDF triples for constituting the so-called Web of Data. Data can be queried using SPARQL but there are still needs for integrating, classifying and exploring the data for data analysis and knowledge discovery purposes. This research work proposes a new approach based on Formal Concept Analysis and Pattern Structures for building a pattern concept lattice from a set of RDF triples. This lattice can be used for data exploration and in particular visualized thanks to an adapted tool. The specific pattern structure introduced for RDF data allows to make a bridge with other studies on the use of structured attribute sets when building concept lattices. Our approach is experimentally validated on the classification of RDF data showing the efficiency of the underlying algorithms.
International Journal of General Systems, 2015
Nowadays data sets are available in very complex and heterogeneous ways. Mining of such data coll... more Nowadays data sets are available in very complex and heterogeneous ways. Mining of such data collections is essential to support many real-world applications ranging from healthcare to marketing. In this work, we focus on the analysis of "complex" sequential data by means of interesting sequential patterns. We approach the problem using the elegant mathematical framework of Formal Concept Analysis (FCA) and its extension based on "pattern structures". Pattern structures are used for mining complex data (such as sequences or graphs) and are based on a subsumption operation, which in our case is defined with respect to the partial order on sequences. We show how pattern structures along with projections (i.e., a data reduction of sequential structures), are able to enumerate more meaningful patterns and increase the computing efficiency of the approach. Finally, we show the applicability of the presented method for discovering and analyzing interesting patient patterns from a French healthcare data set on cancer. The quantitative and qualitative results (with annotations and analysis from a physician) are reported in this use case which is the main motivation for this work.
Lecture Notes in Computer Science, 2015
This article aims at presenting recent advances in Formal Concept Analysis (2010-2015), especiall... more This article aims at presenting recent advances in Formal Concept Analysis (2010-2015), especially when the question is dealing with complex data (numbers, graphs, sequences, etc.) in domains such as databases (functional dependencies), data-mining (local pattern discovery), information retrieval and information fusion. As these advances are mainly published in artificial intelligence and FCA dedicated venues, a dissemination towards data mining and machine learning is worthwhile.
Lecture Notes in Computer Science, 2015
In pattern mining, the main challenge is the exponential explosion of the set of patterns. Typica... more In pattern mining, the main challenge is the exponential explosion of the set of patterns. Typically, to solve this problem, a constraint for pattern selection is introduced. One of the first constraints proposed in pattern mining is support (frequency) of a pattern in a dataset. Frequency is an anti-monotonic function, i.e., given an infrequent pattern, all its superpatterns are not frequent. However, many other constraints for pattern selection are not (anti-)monotonic, which makes it difficult to generate patterns satisfying these constraints. In this paper we introduce the notion of projection-antimonotonicity and θ-Σοφια algorithm that allows efficient generation of the best patterns for some nonmonotonic constraints. In this paper we consider stability and Δ-measure, which are nonmonotonic constraints, and apply them to interval tuple datasets. In the experiments, we compute best interval tuple patterns w.r.t. these measures and show the advantage of our approach over postfiltering approaches.
The popularization and quick growth of Linked Open Data (LOD) has led to challenging aspects rega... more The popularization and quick growth of Linked Open Data (LOD) has led to challenging aspects regarding quality assessment and data exploration of the RDF triples that shape the LOD cloud. Particularly, we are interested in the completeness of the data and the their potential to provide concept definitions in terms of necessary and sufficient conditions. In this work we propose a novel technique based on Formal Concept Analysis which organizes RDF data into a concept lattice. This allows data exploration as well as the discovery of implication rules which are used to automatically detect missing information and then to complete RDF data. Moreover, this is a way of reconciling syntax and semantics in the LOD cloud. Finally experiments on the DBPedia knowledge base show that the approach is well-founded and effective.
The popularization and quick growth of Linked Open Data (LOD) has led to challenging aspects rega... more The popularization and quick growth of Linked Open Data (LOD) has led to challenging aspects regarding quality assessment and data exploration of the RDF triples that shape the LOD cloud. Particularly, we are interested in the completeness of data and its potential to provide concept definitions in terms of necessary and sufficient conditions. In this work we propose a novel technique based on Formal Concept Analysis which organizes RDF data into a concept lattice. This allows the discovery of implications, which are used to automatically detect missing information and then to complete RDF data.
In pattern mining, the main challenge is the exponential explosion of the set of patterns. Typica... more In pattern mining, the main challenge is the exponential explosion of the set of patterns. Typically, to solve this problem, a constraint for pattern selection is introduced. One of the first constraints proposed in pattern mining is support (frequency) of a pattern in a dataset. Frequency is an anti-monotonic function, i.e., given an infrequent pattern, all its superpatterns are not frequent. However, many other constraints for pattern selection are neither monotonic nor anti-monotonic, which makes it difficult to generate patterns satisfying these constraints. In this paper we introduce the notion of "generalized monotonicity" and Sofia algorithm that allow generating best patterns in polynomial time for some nonmonotonic constraints modulo constraint computation and pattern extension operations. In particular, this algorithm is polynomial for data on itemsets and interval tuples. In this paper we consider stability and delta-measure which are nonmonotonic constraints an...
Lecture Notes in Computer Science, 2015
Formal concept analysis (FCA) is a well-founded method for data analysis and has many application... more Formal concept analysis (FCA) is a well-founded method for data analysis and has many applications in data mining. Pattern structures is an extension of FCA for dealing with complex data such as sequences or graphs. However the computational complexity of computing with pattern structures is high and projections of pattern structures were introduced for simplifying computation. In this paper we introduce o-projections of pattern structures, a generalization of projections which defines a wider class of projections preserving the properties of the original approach. Moreover, we show that o-projections form a semilattice and we discuss the correspondence between o-projections and the representation contexts of o-projected pattern structures.
In vivo evaluation of the brain white matter maturation is still a challenging task with no exist... more In vivo evaluation of the brain white matter maturation is still a challenging task with no existing gold standards. In this article we propose an original approach to evaluate the early maturation of the white matter bundles, which is based on comparison of infant and adult groups using the Mahalanobis distance computed from four complementary MRI parameters: quantitative qT1 and qT2 relaxation times, longitudinal k k and transverse k \ diffusivities from diffusion tensor imaging. Such multi-parametric approach is expected to better describe maturational asynchrony than conventional univariate approaches because it takes into account complementary dependencies of the parameters on different maturational processes, notably the decrease in water content and the myelination. Our approach was tested on 17 healthy infants (aged 3-to 21-week old) for 18 different bundles. It finely confirmed maturational asynchrony across the bundles: the spinothalamic tract, the optic radiations, the cortico-spinal tract and the fornix have the most advanced maturation, while the superior longitudinal and arcuate fasciculi, the anterior limb of the internal capsule and the external capsule have the most delayed maturation. Furthermore, this approach was more reliable than univariate approaches as it revealed more maturational relationships between the bundles and did not violate a priori assumptions on the temporal order of the bundle maturation. Mahalanobis distances decreased exponentially with age in all bundles, with the only difference between them explained by different onsets of maturation. Estimation of these relative delays confirmed that the most dramatic changes occur during the first postnatal year. Keywords Mahalanobis distance Á White matter Á Brain development Á Bundles Á Infants Á T1 and T2 relaxometry Á Diffusion tensor Imaging DTI Electronic supplementary material The online version of this article (
Journal of chemical information and modeling, Jan 14, 2015
This study is dedicated to an introduction of a novel method that automatically extracts potentia... more This study is dedicated to an introduction of a novel method that automatically extracts potential structural alerts from a dataset of molecules. These triggering structures can be further used for knowledge discovery and for classification purposes. Computation of the structural alerts results from an implementation of a sophisticated workflow which integrates a graph-mining tool guided by growth-rate and stability. The growth-rate is a well-established measurement of contrast between classes. Moreover, the extracted patterns correspond to formal concepts; the most robust patterns, named the stable emerging patterns (SEPs), can then be identified thanks to their stability, a new notion originating from the domain of Formal Concept Analysis. All these elements are explained in the paper from the point of view of computation. The method was applied on a molecular dataset on mutagenicity. The experimental results demonstrate its efficiency: it automatically outputs a manageable amount...
Lecture Notes in Computer Science, 2014
Data mining aims at finding interesting patterns from datasets, where "interesting" means reflect... more Data mining aims at finding interesting patterns from datasets, where "interesting" means reflecting intrinsic dependencies in the domain of interest rather than just in the dataset. Concept stability is a popular relevancy measure in FCA. Experimental results of this paper show that high stability of a concept for a context derived from the general population suggests that concepts with the same intent in other samples drawn from the population have also high stability. A new estimate of stability is introduced and studied. It is experimentally shown that the introduced estimate gives a better approximation than the Monte Carlo approach introduced earlier.
Procedia Computer Science, 2014
There is a lot of usefulness measures of patterns in data mining. This paper is focused on the me... more There is a lot of usefulness measures of patterns in data mining. This paper is focused on the measures used in Formal Concept Analysis (FCA). In particular, concept stability is a popular relevancy measure in FCA. Experimental results of this paper show that high stability of a pattern in a given dataset derived from the general population suggests that the stability of that pattern is high in another dataset derived from the same population. At the second part of the paper, a new estimate of stability is introduced and studied. It es performance is evaluated experimentally. And it is shown that it is more efficient.
There is a lot of usefulness measures of patterns in data mining. This paper is focused on the me... more There is a lot of usefulness measures of patterns in data mining. This paper is focused on the measures used in Formal Concept Analysis (FCA). In particular, concept stability is a popular relevancy measure in FCA. Experimental results of this paper show that high stability of a pattern in a given dataset derived from the general population suggests that the stability of that pattern is high in another dataset derived from the same population. At the second part of the paper, a new estimate of stability is introduced and studied. It es performance is evaluated experimentally. And it is shown that it is more efficient.
Data mining aims at finding interesting patterns from datasets, where "interesting" means reflect... more Data mining aims at finding interesting patterns from datasets, where "interesting" means reflecting intrinsic dependencies in the domain of interest rather than just in the dataset. Concept stability is a popular relevancy measure in FCA. Experimental results of this paper show that high stability of a concept for a context derived from the general population suggests that concepts with the same intent in other samples drawn from the population have also high stability. A new estimate of stability is introduced and studied. It is experimentally shown that the introduced estimate gives a better approximation than the Monte Carlo approach introduced earlier.
In vivo evaluation of the brain white matter maturation is still a challenging task with no exist... more In vivo evaluation of the brain white matter maturation is still a challenging task with no existing gold standards. In this article we propose an original approach to evaluate the early maturation of the white matter bundles, which is based on comparison of infant and adult groups using the Mahalanobis distance computed from four complementary MRI parameters: quantitative qT1 and qT2 relaxation times, longitudinal k k and transverse k \ diffusivities from diffusion tensor imaging. Such multi-parametric approach is expected to better describe maturational asynchrony than conventional univariate approaches because it takes into account complementary dependencies of the parameters on different maturational processes, notably the decrease in water content and the myelination. Our approach was tested on 17 healthy infants (aged 3-to 21-week old) for 18 different bundles. It finely confirmed maturational asynchrony across the bundles: the spinothalamic tract, the optic radiations, the cortico-spinal tract and the fornix have the most advanced maturation, while the superior longitudinal and arcuate fasciculi, the anterior limb of the internal capsule and the external capsule have the most delayed maturation. Furthermore, this approach was more reliable than univariate approaches as it revealed more maturational relationships between the bundles and did not violate a priori assumptions on the temporal order of the bundle maturation. Mahalanobis distances decreased exponentially with age in all bundles, with the only difference between them explained by different onsets of maturation. Estimation of these relative delays confirmed that the most dramatic changes occur during the first postnatal year. Keywords Mahalanobis distance Á White matter Á Brain development Á Bundles Á Infants Á T1 and T2 relaxometry Á Diffusion tensor Imaging DTI Electronic supplementary material The online version of this article (
Discrete Applied Mathematics
With an increased interest in machine processable data and with the progress of semantic technolo... more With an increased interest in machine processable data and with the progress of semantic technologies, many datasets are now published in the form of RDF triples for constituting the so-called Web of Data. Data can be queried using SPARQL but there are still needs for integrating, classifying and exploring the data for data analysis and knowledge discovery purposes. This research work proposes a new approach based on Formal Concept Analysis and Pattern Structures for building a pattern concept lattice from a set of RDF triples. This lattice can be used for data exploration and in particular visualized thanks to an adapted tool. The specific pattern structure introduced for RDF data allows to make a bridge with other studies on the use of structured attribute sets when building concept lattices. Our approach is experimentally validated on the classification of RDF data showing the efficiency of the underlying algorithms.
International Journal of General Systems, 2015
Nowadays data sets are available in very complex and heterogeneous ways. Mining of such data coll... more Nowadays data sets are available in very complex and heterogeneous ways. Mining of such data collections is essential to support many real-world applications ranging from healthcare to marketing. In this work, we focus on the analysis of "complex" sequential data by means of interesting sequential patterns. We approach the problem using the elegant mathematical framework of Formal Concept Analysis (FCA) and its extension based on "pattern structures". Pattern structures are used for mining complex data (such as sequences or graphs) and are based on a subsumption operation, which in our case is defined with respect to the partial order on sequences. We show how pattern structures along with projections (i.e., a data reduction of sequential structures), are able to enumerate more meaningful patterns and increase the computing efficiency of the approach. Finally, we show the applicability of the presented method for discovering and analyzing interesting patient patterns from a French healthcare data set on cancer. The quantitative and qualitative results (with annotations and analysis from a physician) are reported in this use case which is the main motivation for this work.
Lecture Notes in Computer Science, 2015
This article aims at presenting recent advances in Formal Concept Analysis (2010-2015), especiall... more This article aims at presenting recent advances in Formal Concept Analysis (2010-2015), especially when the question is dealing with complex data (numbers, graphs, sequences, etc.) in domains such as databases (functional dependencies), data-mining (local pattern discovery), information retrieval and information fusion. As these advances are mainly published in artificial intelligence and FCA dedicated venues, a dissemination towards data mining and machine learning is worthwhile.
Lecture Notes in Computer Science, 2015
In pattern mining, the main challenge is the exponential explosion of the set of patterns. Typica... more In pattern mining, the main challenge is the exponential explosion of the set of patterns. Typically, to solve this problem, a constraint for pattern selection is introduced. One of the first constraints proposed in pattern mining is support (frequency) of a pattern in a dataset. Frequency is an anti-monotonic function, i.e., given an infrequent pattern, all its superpatterns are not frequent. However, many other constraints for pattern selection are not (anti-)monotonic, which makes it difficult to generate patterns satisfying these constraints. In this paper we introduce the notion of projection-antimonotonicity and θ-Σοφια algorithm that allows efficient generation of the best patterns for some nonmonotonic constraints. In this paper we consider stability and Δ-measure, which are nonmonotonic constraints, and apply them to interval tuple datasets. In the experiments, we compute best interval tuple patterns w.r.t. these measures and show the advantage of our approach over postfiltering approaches.
The popularization and quick growth of Linked Open Data (LOD) has led to challenging aspects rega... more The popularization and quick growth of Linked Open Data (LOD) has led to challenging aspects regarding quality assessment and data exploration of the RDF triples that shape the LOD cloud. Particularly, we are interested in the completeness of the data and the their potential to provide concept definitions in terms of necessary and sufficient conditions. In this work we propose a novel technique based on Formal Concept Analysis which organizes RDF data into a concept lattice. This allows data exploration as well as the discovery of implication rules which are used to automatically detect missing information and then to complete RDF data. Moreover, this is a way of reconciling syntax and semantics in the LOD cloud. Finally experiments on the DBPedia knowledge base show that the approach is well-founded and effective.
The popularization and quick growth of Linked Open Data (LOD) has led to challenging aspects rega... more The popularization and quick growth of Linked Open Data (LOD) has led to challenging aspects regarding quality assessment and data exploration of the RDF triples that shape the LOD cloud. Particularly, we are interested in the completeness of data and its potential to provide concept definitions in terms of necessary and sufficient conditions. In this work we propose a novel technique based on Formal Concept Analysis which organizes RDF data into a concept lattice. This allows the discovery of implications, which are used to automatically detect missing information and then to complete RDF data.
In pattern mining, the main challenge is the exponential explosion of the set of patterns. Typica... more In pattern mining, the main challenge is the exponential explosion of the set of patterns. Typically, to solve this problem, a constraint for pattern selection is introduced. One of the first constraints proposed in pattern mining is support (frequency) of a pattern in a dataset. Frequency is an anti-monotonic function, i.e., given an infrequent pattern, all its superpatterns are not frequent. However, many other constraints for pattern selection are neither monotonic nor anti-monotonic, which makes it difficult to generate patterns satisfying these constraints. In this paper we introduce the notion of "generalized monotonicity" and Sofia algorithm that allow generating best patterns in polynomial time for some nonmonotonic constraints modulo constraint computation and pattern extension operations. In particular, this algorithm is polynomial for data on itemsets and interval tuples. In this paper we consider stability and delta-measure which are nonmonotonic constraints an...
Lecture Notes in Computer Science, 2015
Formal concept analysis (FCA) is a well-founded method for data analysis and has many application... more Formal concept analysis (FCA) is a well-founded method for data analysis and has many applications in data mining. Pattern structures is an extension of FCA for dealing with complex data such as sequences or graphs. However the computational complexity of computing with pattern structures is high and projections of pattern structures were introduced for simplifying computation. In this paper we introduce o-projections of pattern structures, a generalization of projections which defines a wider class of projections preserving the properties of the original approach. Moreover, we show that o-projections form a semilattice and we discuss the correspondence between o-projections and the representation contexts of o-projected pattern structures.
In vivo evaluation of the brain white matter maturation is still a challenging task with no exist... more In vivo evaluation of the brain white matter maturation is still a challenging task with no existing gold standards. In this article we propose an original approach to evaluate the early maturation of the white matter bundles, which is based on comparison of infant and adult groups using the Mahalanobis distance computed from four complementary MRI parameters: quantitative qT1 and qT2 relaxation times, longitudinal k k and transverse k \ diffusivities from diffusion tensor imaging. Such multi-parametric approach is expected to better describe maturational asynchrony than conventional univariate approaches because it takes into account complementary dependencies of the parameters on different maturational processes, notably the decrease in water content and the myelination. Our approach was tested on 17 healthy infants (aged 3-to 21-week old) for 18 different bundles. It finely confirmed maturational asynchrony across the bundles: the spinothalamic tract, the optic radiations, the cortico-spinal tract and the fornix have the most advanced maturation, while the superior longitudinal and arcuate fasciculi, the anterior limb of the internal capsule and the external capsule have the most delayed maturation. Furthermore, this approach was more reliable than univariate approaches as it revealed more maturational relationships between the bundles and did not violate a priori assumptions on the temporal order of the bundle maturation. Mahalanobis distances decreased exponentially with age in all bundles, with the only difference between them explained by different onsets of maturation. Estimation of these relative delays confirmed that the most dramatic changes occur during the first postnatal year. Keywords Mahalanobis distance Á White matter Á Brain development Á Bundles Á Infants Á T1 and T2 relaxometry Á Diffusion tensor Imaging DTI Electronic supplementary material The online version of this article (
Journal of chemical information and modeling, Jan 14, 2015
This study is dedicated to an introduction of a novel method that automatically extracts potentia... more This study is dedicated to an introduction of a novel method that automatically extracts potential structural alerts from a dataset of molecules. These triggering structures can be further used for knowledge discovery and for classification purposes. Computation of the structural alerts results from an implementation of a sophisticated workflow which integrates a graph-mining tool guided by growth-rate and stability. The growth-rate is a well-established measurement of contrast between classes. Moreover, the extracted patterns correspond to formal concepts; the most robust patterns, named the stable emerging patterns (SEPs), can then be identified thanks to their stability, a new notion originating from the domain of Formal Concept Analysis. All these elements are explained in the paper from the point of view of computation. The method was applied on a molecular dataset on mutagenicity. The experimental results demonstrate its efficiency: it automatically outputs a manageable amount...
Lecture Notes in Computer Science, 2014
Data mining aims at finding interesting patterns from datasets, where "interesting" means reflect... more Data mining aims at finding interesting patterns from datasets, where "interesting" means reflecting intrinsic dependencies in the domain of interest rather than just in the dataset. Concept stability is a popular relevancy measure in FCA. Experimental results of this paper show that high stability of a concept for a context derived from the general population suggests that concepts with the same intent in other samples drawn from the population have also high stability. A new estimate of stability is introduced and studied. It is experimentally shown that the introduced estimate gives a better approximation than the Monte Carlo approach introduced earlier.
Procedia Computer Science, 2014
There is a lot of usefulness measures of patterns in data mining. This paper is focused on the me... more There is a lot of usefulness measures of patterns in data mining. This paper is focused on the measures used in Formal Concept Analysis (FCA). In particular, concept stability is a popular relevancy measure in FCA. Experimental results of this paper show that high stability of a pattern in a given dataset derived from the general population suggests that the stability of that pattern is high in another dataset derived from the same population. At the second part of the paper, a new estimate of stability is introduced and studied. It es performance is evaluated experimentally. And it is shown that it is more efficient.
There is a lot of usefulness measures of patterns in data mining. This paper is focused on the me... more There is a lot of usefulness measures of patterns in data mining. This paper is focused on the measures used in Formal Concept Analysis (FCA). In particular, concept stability is a popular relevancy measure in FCA. Experimental results of this paper show that high stability of a pattern in a given dataset derived from the general population suggests that the stability of that pattern is high in another dataset derived from the same population. At the second part of the paper, a new estimate of stability is introduced and studied. It es performance is evaluated experimentally. And it is shown that it is more efficient.
Data mining aims at finding interesting patterns from datasets, where "interesting" means reflect... more Data mining aims at finding interesting patterns from datasets, where "interesting" means reflecting intrinsic dependencies in the domain of interest rather than just in the dataset. Concept stability is a popular relevancy measure in FCA. Experimental results of this paper show that high stability of a concept for a context derived from the general population suggests that concepts with the same intent in other samples drawn from the population have also high stability. A new estimate of stability is introduced and studied. It is experimentally shown that the introduced estimate gives a better approximation than the Monte Carlo approach introduced earlier.
In vivo evaluation of the brain white matter maturation is still a challenging task with no exist... more In vivo evaluation of the brain white matter maturation is still a challenging task with no existing gold standards. In this article we propose an original approach to evaluate the early maturation of the white matter bundles, which is based on comparison of infant and adult groups using the Mahalanobis distance computed from four complementary MRI parameters: quantitative qT1 and qT2 relaxation times, longitudinal k k and transverse k \ diffusivities from diffusion tensor imaging. Such multi-parametric approach is expected to better describe maturational asynchrony than conventional univariate approaches because it takes into account complementary dependencies of the parameters on different maturational processes, notably the decrease in water content and the myelination. Our approach was tested on 17 healthy infants (aged 3-to 21-week old) for 18 different bundles. It finely confirmed maturational asynchrony across the bundles: the spinothalamic tract, the optic radiations, the cortico-spinal tract and the fornix have the most advanced maturation, while the superior longitudinal and arcuate fasciculi, the anterior limb of the internal capsule and the external capsule have the most delayed maturation. Furthermore, this approach was more reliable than univariate approaches as it revealed more maturational relationships between the bundles and did not violate a priori assumptions on the temporal order of the bundle maturation. Mahalanobis distances decreased exponentially with age in all bundles, with the only difference between them explained by different onsets of maturation. Estimation of these relative delays confirmed that the most dramatic changes occur during the first postnatal year. Keywords Mahalanobis distance Á White matter Á Brain development Á Bundles Á Infants Á T1 and T2 relaxometry Á Diffusion tensor Imaging DTI Electronic supplementary material The online version of this article (