Michel Koskas - Academia.edu (original) (raw)
Papers by Michel Koskas
Networks is now the most popular way to describe interaction between biological ob-jects. Studyin... more Networks is now the most popular way to describe interaction between biological ob-jects. Studying network motifs is of particular interest in systems biology because these building blocks constitute functional units. We propose a tool to compute and statistically study the total number of occurrences of a given connected sub-graph, called topological motif, in a network. This tool relies on two very efficient algorithms to enumerate and/or count all the occurrences of a given topological motif in a given graph. Moreover, it implements approximate p-value computa-tion in several probabilistic graph models extending some previous statistical results. The method is available through an R package named NeMo.
We present in this paper a generic implementation of the Pruned Dynamic Programing Algorithm. We ... more We present in this paper a generic implementation of the Pruned Dynamic Programing Algorithm. We discuss the performance of this algorithm compared to that of several algorithms (PELT, CART) -also programed in C++ to allow a fair comparison. The program was written in a full template way, thus allowing a large range of applications and a convenient way of adding extensions.
Proceedings of the Fifth Mexican International Conference in Computer Science, 2004. ENC 2004., 2004
Discovering patterns or frequent episodes in transactions is an important problem in data-mining ... more Discovering patterns or frequent episodes in transactions is an important problem in data-mining for the purpose of infering deductive rules from them. Because of the huge size of the data to deal with, parallel algorithms have been designed for reducing both the execution time and the number of repeated passes over the database in order to reduce, as much as possible, I/O overheads. In this paper, we introduce new approaches for the implementation of two basic algorithms for association rules discovery (namely Apriori and Eclat). Our approaches combine efficient data structures to code different key information (line indexes, candidates) and we exhibit how to introduce parallelism for processing such data-structures.
2006 15th IEEE International Conference on High Performance Distributed Computing, 2006
1 Université P. & M. Curie-Paris6, Théorie des nombres, Institut de Mathématiques de Jussieu,... more 1 Université P. & M. Curie-Paris6, Théorie des nombres, Institut de Mathématiques de Jussieu, Paris, F-75005 France, 2 Université de Paris Nord LIPN, UMR CNRS 7030, 99 avenue J. Batiste Clément, 93430 Villetaneuse France 3 Ecole Supérieure des Sciences et ...
2011 23rd International Symposium on Computer Architecture and High Performance Computing, 2011
As the number of processors embedded in high performance computing platforms becomes higher and h... more As the number of processors embedded in high performance computing platforms becomes higher and higher, it is vital to force the developers to enhance the scalability of their codes in order to exploit all the resources of the platforms. This often requires new algorithms, techniques and methods for code development that add to the application code new properties: the presence
16th Symposium on Computer Architecture and High Performance Computing, 2004
The aim of the paper is to introduce techniques in order to optimize the parallel execution time ... more The aim of the paper is to introduce techniques in order to optimize the parallel execution time of sorting on heterogeneous platforms (processors speeds are related by a constant factor). We develop a constant time technique for mastering processor load balancing and execution time in an heterogeneous environment. We develop an analytical model for the parallel execution time, sustained by preliminary experimental results in the case of a 2-processors systems. The computation of the solution is independent of the problem size. Consequently, there is no overhead regarding the sorting problem. Keywords: in-core parallel sorting algorithms, heterogeneous computing, complexity of parallel algorithms.
Theoretical Computer Science, 2011
For discrete sets coded by the Freeman chain describing their contour, several linear algorithms ... more For discrete sets coded by the Freeman chain describing their contour, several linear algorithms have been designed for determining their shape properties. Most of them are based on the assumption that the boundary word forms a closed and non-intersecting discrete curve. In this article, we provide a linear time and space algorithm for deciding whether a path on a square lattice intersects itself. forms the contour of a discrete figure. This is achieved by adding a radix tree structure over a quadtree, where nodes represents grid points, enriched with neighborhood links that are essential for obtaining linearity. Due to its simplicity, this algorithm has many applications and, as an illustrative example, we use it for determining efficiently a solution to the more general problem of multiple paths intersection.
Parallel Processing Letters, 2006
Computing Research Repository - CORR, 2008
We say x ∈ {0,1,2}N is a word with Sturmian erasures if for any a ∈ {0,1,2} the word obtained era... more We say x ∈ {0,1,2}N is a word with Sturmian erasures if for any a ∈ {0,1,2} the word obtained erasing all a in x is a Sturmian word. A large family of such words is given coding trajectories of balls in the game of billiards in the cube. We prove that the monoid of morphisms mapping all words with Sturmian erasures to words with Sturmian erasures is not finitely generated.
Journal of Computational Biology, 2008
Future Generation Computer Systems, 2006
The aim of the paper is to introduce techniques in order to tune sequential in-core sorting algor... more The aim of the paper is to introduce techniques in order to tune sequential in-core sorting algorithms in the frameworks of two applications. The first application is parallel sorting when the processor speeds are not identical in the parallel system. The second application is the Zeta-Data Project (Koskas , 2003) whose aim is to develop novel algorithms for databases issues. About 50% of the work done in building indexes is devoted to sorting sets of integers. We develop and compare algorithms built to sort with equal keys. Algorithms are variations of the 3way-Quicksort of Segdewick. In order to observe performances and to fully exploit functional units in processors and also in order to optimize the use of the memory system and the different functional units, we use hardware performance counters that are available on most modern microprocessors. We develop also analytical results for one of our algorithms and compare expected results with the measures. For the two applications, we show through fine experiments on an Athlon processor (a three-way superscalar x86 processor), that L1 data cache misses is not the central problem but a subtil proportion of independent retired instructions should be advised to get performance for in-core sorting.
Algorithms for Molecular Biology, 2014
Change point problems arise in many genomic analyses such as the detection of copy number variati... more Change point problems arise in many genomic analyses such as the detection of copy number variations or the detection of transcribed regions. The expanding Next Generation Sequencing technologies now allow to locate change points at the nucleotide resolution. Because of its complexity which is almost linear in the sequence length when the maximal number of segments is constant, and as its performance had been acknowledged for microarrays, we propose to use the Pruned Dynamic Programming algorithm for Seq-experiment outputs. This requires the adaptation of the algorithm to the negative binomial distribution with which we model the data. We show that if the dispersion in the signal is known, the PDP algorithm can be used, and we provide an estimator for this dispersion. We describe a compression framework which reduces the time complexity without modifying the accuracy of the segmentation. We propose to estimate the number of segments via a penalized likelihood criterion. We illustrate the performance of the proposed methodology on RNA-Seq data. We illustrate the results of our approach on a real dataset and show its good performance. Our algorithm is available as an R package on the CRAN repository.
Theoretical Computer Science, 2003
We prove that, given a double sequence w over the alphabet A (i.e. a mapping from Z 2 to A), if t... more We prove that, given a double sequence w over the alphabet A (i.e. a mapping from Z 2 to A), if there exists a pair (n0; m0) ∈ Z 2 such that pw(n0; m0) ¡ 1 100 n0m0, then w has a periodicity vector, where pw is the complexity function in rectangles of w.
Networks is now the most popular way to describe interaction between biological ob-jects. Studyin... more Networks is now the most popular way to describe interaction between biological ob-jects. Studying network motifs is of particular interest in systems biology because these building blocks constitute functional units. We propose a tool to compute and statistically study the total number of occurrences of a given connected sub-graph, called topological motif, in a network. This tool relies on two very efficient algorithms to enumerate and/or count all the occurrences of a given topological motif in a given graph. Moreover, it implements approximate p-value computa-tion in several probabilistic graph models extending some previous statistical results. The method is available through an R package named NeMo.
We present in this paper a generic implementation of the Pruned Dynamic Programing Algorithm. We ... more We present in this paper a generic implementation of the Pruned Dynamic Programing Algorithm. We discuss the performance of this algorithm compared to that of several algorithms (PELT, CART) -also programed in C++ to allow a fair comparison. The program was written in a full template way, thus allowing a large range of applications and a convenient way of adding extensions.
Proceedings of the Fifth Mexican International Conference in Computer Science, 2004. ENC 2004., 2004
Discovering patterns or frequent episodes in transactions is an important problem in data-mining ... more Discovering patterns or frequent episodes in transactions is an important problem in data-mining for the purpose of infering deductive rules from them. Because of the huge size of the data to deal with, parallel algorithms have been designed for reducing both the execution time and the number of repeated passes over the database in order to reduce, as much as possible, I/O overheads. In this paper, we introduce new approaches for the implementation of two basic algorithms for association rules discovery (namely Apriori and Eclat). Our approaches combine efficient data structures to code different key information (line indexes, candidates) and we exhibit how to introduce parallelism for processing such data-structures.
2006 15th IEEE International Conference on High Performance Distributed Computing, 2006
1 Université P. & M. Curie-Paris6, Théorie des nombres, Institut de Mathématiques de Jussieu,... more 1 Université P. & M. Curie-Paris6, Théorie des nombres, Institut de Mathématiques de Jussieu, Paris, F-75005 France, 2 Université de Paris Nord LIPN, UMR CNRS 7030, 99 avenue J. Batiste Clément, 93430 Villetaneuse France 3 Ecole Supérieure des Sciences et ...
2011 23rd International Symposium on Computer Architecture and High Performance Computing, 2011
As the number of processors embedded in high performance computing platforms becomes higher and h... more As the number of processors embedded in high performance computing platforms becomes higher and higher, it is vital to force the developers to enhance the scalability of their codes in order to exploit all the resources of the platforms. This often requires new algorithms, techniques and methods for code development that add to the application code new properties: the presence
16th Symposium on Computer Architecture and High Performance Computing, 2004
The aim of the paper is to introduce techniques in order to optimize the parallel execution time ... more The aim of the paper is to introduce techniques in order to optimize the parallel execution time of sorting on heterogeneous platforms (processors speeds are related by a constant factor). We develop a constant time technique for mastering processor load balancing and execution time in an heterogeneous environment. We develop an analytical model for the parallel execution time, sustained by preliminary experimental results in the case of a 2-processors systems. The computation of the solution is independent of the problem size. Consequently, there is no overhead regarding the sorting problem. Keywords: in-core parallel sorting algorithms, heterogeneous computing, complexity of parallel algorithms.
Theoretical Computer Science, 2011
For discrete sets coded by the Freeman chain describing their contour, several linear algorithms ... more For discrete sets coded by the Freeman chain describing their contour, several linear algorithms have been designed for determining their shape properties. Most of them are based on the assumption that the boundary word forms a closed and non-intersecting discrete curve. In this article, we provide a linear time and space algorithm for deciding whether a path on a square lattice intersects itself. forms the contour of a discrete figure. This is achieved by adding a radix tree structure over a quadtree, where nodes represents grid points, enriched with neighborhood links that are essential for obtaining linearity. Due to its simplicity, this algorithm has many applications and, as an illustrative example, we use it for determining efficiently a solution to the more general problem of multiple paths intersection.
Parallel Processing Letters, 2006
Computing Research Repository - CORR, 2008
We say x ∈ {0,1,2}N is a word with Sturmian erasures if for any a ∈ {0,1,2} the word obtained era... more We say x ∈ {0,1,2}N is a word with Sturmian erasures if for any a ∈ {0,1,2} the word obtained erasing all a in x is a Sturmian word. A large family of such words is given coding trajectories of balls in the game of billiards in the cube. We prove that the monoid of morphisms mapping all words with Sturmian erasures to words with Sturmian erasures is not finitely generated.
Journal of Computational Biology, 2008
Future Generation Computer Systems, 2006
The aim of the paper is to introduce techniques in order to tune sequential in-core sorting algor... more The aim of the paper is to introduce techniques in order to tune sequential in-core sorting algorithms in the frameworks of two applications. The first application is parallel sorting when the processor speeds are not identical in the parallel system. The second application is the Zeta-Data Project (Koskas , 2003) whose aim is to develop novel algorithms for databases issues. About 50% of the work done in building indexes is devoted to sorting sets of integers. We develop and compare algorithms built to sort with equal keys. Algorithms are variations of the 3way-Quicksort of Segdewick. In order to observe performances and to fully exploit functional units in processors and also in order to optimize the use of the memory system and the different functional units, we use hardware performance counters that are available on most modern microprocessors. We develop also analytical results for one of our algorithms and compare expected results with the measures. For the two applications, we show through fine experiments on an Athlon processor (a three-way superscalar x86 processor), that L1 data cache misses is not the central problem but a subtil proportion of independent retired instructions should be advised to get performance for in-core sorting.
Algorithms for Molecular Biology, 2014
Change point problems arise in many genomic analyses such as the detection of copy number variati... more Change point problems arise in many genomic analyses such as the detection of copy number variations or the detection of transcribed regions. The expanding Next Generation Sequencing technologies now allow to locate change points at the nucleotide resolution. Because of its complexity which is almost linear in the sequence length when the maximal number of segments is constant, and as its performance had been acknowledged for microarrays, we propose to use the Pruned Dynamic Programming algorithm for Seq-experiment outputs. This requires the adaptation of the algorithm to the negative binomial distribution with which we model the data. We show that if the dispersion in the signal is known, the PDP algorithm can be used, and we provide an estimator for this dispersion. We describe a compression framework which reduces the time complexity without modifying the accuracy of the segmentation. We propose to estimate the number of segments via a penalized likelihood criterion. We illustrate the performance of the proposed methodology on RNA-Seq data. We illustrate the results of our approach on a real dataset and show its good performance. Our algorithm is available as an R package on the CRAN repository.
Theoretical Computer Science, 2003
We prove that, given a double sequence w over the alphabet A (i.e. a mapping from Z 2 to A), if t... more We prove that, given a double sequence w over the alphabet A (i.e. a mapping from Z 2 to A), if there exists a pair (n0; m0) ∈ Z 2 such that pw(n0; m0) ¡ 1 100 n0m0, then w has a periodicity vector, where pw is the complexity function in rectangles of w.