George Karypis | University of Minnesota (original) (raw)
Papers by George Karypis
Understanding the role proteins play in regulating DNA replication is essential to forming a comp... more Understanding the role proteins play in regulating DNA replication is essential to forming a complete picture of how the genome manifests itself. In this work, we examine the feasibility of predicting the residues of a protein essential to binding by analyzing protein-DNA interactions from an information theoretic perspective. Through the lens of mutual information, we explore which properties of protein sequence and structure are most useful in determining binding residues with a particular focus on sequence features. We find that the quantity of information carried in most features is small with respect to DNA-contacting residues, the bulk being provided by sequence features along with a select few structural features.
Understanding the role proteins play in regulating DNA replication is essential to forming a comp... more Understanding the role proteins play in regulating DNA replication is essential to forming a complete picture of how the genome manifests itself. In this work, we examine the feasibility of predicting the residues of a protein essential to binding by analyzing protein-DNA interactions from an information theoretic perspective. Through the lens of mutual information, we explore which properties of protein sequence and structure are most useful in determining binding residues with a particular focus on sequence features. We find that the quantity of information carried in most features is small with respect to DNA-contacting residues, the bulk being provided by sequence features along with a select few structural features.
Motivation: Alpha-helical transmembrane proteins mediate many key biological pro- cesses and repr... more Motivation: Alpha-helical transmembrane proteins mediate many key biological pro- cesses and represent 20-30% of all genes in many organisms. Due to the diculties in experimentally determining their high-resolution 3D struc- ture, computational methods that predict their topology (transmembrane helical segments and their orientation) are essential in advancing the un- derstanding of membrane proteins' structures and functions. 2 Methods: We developed
Journal of Bioinformatics and Computational Biology, 2010
Alpha-helical transmembrane proteins mediate many key biological processes and represent 20-30% o... more Alpha-helical transmembrane proteins mediate many key biological processes and represent 20-30% of all genes in many organisms. Due to the difficulties in experimentally determining their high-resolution 3D structure, computational methods that predict their topology (transmembrane helical segments and their orientation) are essential in advancing the understanding of membrane proteins' structures and functions.
Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281), 1998
ABSTRACT
Proceedings 42nd IEEE Symposium on Foundations of Computer Science, 2001
The efficient execution of scientific simulations on HPC systems requires a partitioning of the u... more The efficient execution of scientific simulations on HPC systems requires a partitioning of the underlying mesh among the processors such that the load is balanced and the inter-processor communication is minimized. Graph partitioning algorithms have been applied with much success for this purpose. However, the parallelization of multi-phase and multi-physics computations poses new challenges that require fundamental advances in graph partitioning technology. In addition, most existing graph partitioning algorithms are not suited for the newer heterogeneous highperformance computing platforms. This talk will describe research efforts in our group that are focused on developing novel multi-constraint and multi-objective graph partitioning algorithms that can support the advancing state-ofthe-art in numerical simulation technologies. In addition, we will present our preliminary work on new partitioning algorithms that are well suited for heterogeneous architectures.
Proceedings of the IEEE/ACM SC98 Conference, 1998
One ingredient which is viewed as vital to the successful conduct of many large-scale numerical s... more One ingredient which is viewed as vital to the successful conduct of many large-scale numerical simulations is the ability to dynamically repartition the underlying adaptive finite element mesh among the processors so that the computations are balanced and interprocessor communication is minimized. This requires that a sequence of partitions of the computational mesh be computed during the course of the computation in which the amount of data migration necessary to realize subsequent partitions is minimized, while all of the domains of a given partition contain a roughly equal amount of computational weight. Recently, parallel multilevel graph repartitioning techniques have been developed that can quickly compute high-quality repartitions for adaptive and dynamic meshes while minimizing the amount of data which needs to be migrated between file:///C|/Karypis/Work/001%20Performance%20Review/T...partitioning%20of%20adaptively%20refine%20meshes.htm (1 of 13)8/30/2003 5:56:40 AM Dynamic Repartitioning of Adaptively Refined Meshes processors. These algorithms can be categorized as either schemes which compute a new partition from scratch and then intelligently remap this partition to the original partition (hereafter referred to as scratch-remap schemes), or multilevel diffusion schemes. Scratch-remap schemes work quite well for graphs which are highly imbalanced in localized areas. On slightly to moderately imbalanced graphs and those in which imbalance occurs globally throughout the graph, however, they result in excessive vertex migration compared to multilevel diffusion algorithms. On the other hand, diffusion-based schemes work well for slightly imbalanced graphs and for those in which imbalance occurs globally throughout the graph. However, these schemes perform poorly on graphs that are highly imbalanced in localized areas, as the propagation of diffusion over long distances results in excessive edge-cut and vertex migration results. In this paper, we present two new schemes for adaptive repartitioning: Locally-Matched Multilevel Scratch-Remap (or LMSR) and Wavefront Diffusion. The LMSR scheme performs purely local coarsening and partition remapping in a multilevel context. In Wavefront Diffusion, the flow of vertices move in a wavefront from overbalanced to underbalanced domains. We present experimental evaluations of our LMSR and Wavefront Diffusion algorithms on synthetically generated adaptive meshes as well as on some application meshes. We show that our LMSR algorithm decreases the amount of vertex migration required to balance the graph and produces repartitionings of similar quality compared to state-of-the-art scratch-remap schemes. Furthermore, we show that our LMSR algorithm is more scalable in terms of execution time compared to state-of-the-art scratch-remap schemes. We show that our Wavefront Diffusion algorithm obtains significantly lower vertex migration requirements, while maintaining similar edge-cut results compared to state-of-the-art multilevel diffusion algorithms, especially for highly imbalanced graphs. Furthermore, we compare Wavefront Diffusion with LMSR and show that the former will result in lower vertex migration requirements and the later will result in higher quality edge-cut results. These results hold true regardless of the distance which diffusion is required to propagate in order to balance the graph. Finally, we discuss the run times of our schemes which are both capable of repartitioning an eight million node graph in under three seconds on a 128-processor Cray T3E.
Proceedings of the 1999 International Conference on Parallel Processing, 1999
Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556), 2000
A job submitted to the grid can be executed by any of the servers; however, resource size or bala... more A job submitted to the grid can be executed by any of the servers; however, resource size or balance may be different across servers. One approach to resource management for this grid is to layer a global load distribution system on top of the local job management systems at each site. Unfortunately, classical load distribution policies fail on two aspects when applied to a multi-resource server grid. First, simple load indices may not recognize that a resource imbalance exists at a server. Second, classical job selection policies do not actively correct such a resource imbalanced state. We show through simulation that new policies based on resource balancing perform consistently better than the classical load distribution strategies.
CHO cells, the workhorses of the biopharmaceutical industry, are derived from the Chinese hamster... more CHO cells, the workhorses of the biopharmaceutical industry, are derived from the Chinese hamster, arguably making it the most economically important industrial organism. The synergistic application of high-throughput sequencing technologies, along with the existing CHO EST collection as backbone, enabled the efficient assembly of the Chinese hamster genome. The current assembly (~2.5Gb), constituting over two billion sequence reads, includes more than 25,000 annotated genes across a range of functional classes. This has allowed a global comparative analysis with the mouse, rat and human genomes. Furthermore, the investigation of regulatory features including promoters, CpG Islands and microRNAs has opened up new avenues for manipulating individual gene expression as well as genome level interventions. In addition, this work aims to study the genetic variation underlying economically important productivity traits in CHO cells, by a comparative genomics approach, with diploid hamster...
Methods and Algorithms, 2010
Proceedings of the 2007 SIAM International Conference on Data Mining, 2007
In this paper, we explore the discriminating subsequencebased clustering problem. First, several ... more In this paper, we explore the discriminating subsequencebased clustering problem. First, several effective optimization techniques are proposed to accelerate the sequence mining process and a new algorithm, CONTOUR, is developed to efficiently and directly mine a subset of discriminating frequent subsequences which can be used to cluster the input sequences. Second, an accurate hierarchical clustering algorithm, SSC, is constructed based on the result of CON-TOUR. The performance study evaluates the efficiency and scalability of CONTOUR, and the clustering quality of SSC.
Understanding the role proteins play in regulating DNA replication is essential to forming a comp... more Understanding the role proteins play in regulating DNA replication is essential to forming a complete picture of how the genome manifests itself. In this work, we examine the feasibility of predicting the residues of a protein essential to binding by analyzing protein-DNA interactions from an information theoretic perspective. Through the lens of mutual information, we explore which properties of protein sequence and structure are most useful in determining binding residues with a particular focus on sequence features. We find that the quantity of information carried in most features is small with respect to DNA-contacting residues, the bulk being provided by sequence features along with a select few structural features.
Understanding the role proteins play in regulating DNA replication is essential to forming a comp... more Understanding the role proteins play in regulating DNA replication is essential to forming a complete picture of how the genome manifests itself. In this work, we examine the feasibility of predicting the residues of a protein essential to binding by analyzing protein-DNA interactions from an information theoretic perspective. Through the lens of mutual information, we explore which properties of protein sequence and structure are most useful in determining binding residues with a particular focus on sequence features. We find that the quantity of information carried in most features is small with respect to DNA-contacting residues, the bulk being provided by sequence features along with a select few structural features.
Motivation: Alpha-helical transmembrane proteins mediate many key biological pro- cesses and repr... more Motivation: Alpha-helical transmembrane proteins mediate many key biological pro- cesses and represent 20-30% of all genes in many organisms. Due to the diculties in experimentally determining their high-resolution 3D struc- ture, computational methods that predict their topology (transmembrane helical segments and their orientation) are essential in advancing the un- derstanding of membrane proteins' structures and functions. 2 Methods: We developed
Journal of Bioinformatics and Computational Biology, 2010
Alpha-helical transmembrane proteins mediate many key biological processes and represent 20-30% o... more Alpha-helical transmembrane proteins mediate many key biological processes and represent 20-30% of all genes in many organisms. Due to the difficulties in experimentally determining their high-resolution 3D structure, computational methods that predict their topology (transmembrane helical segments and their orientation) are essential in advancing the understanding of membrane proteins' structures and functions.
Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281), 1998
ABSTRACT
Proceedings 42nd IEEE Symposium on Foundations of Computer Science, 2001
The efficient execution of scientific simulations on HPC systems requires a partitioning of the u... more The efficient execution of scientific simulations on HPC systems requires a partitioning of the underlying mesh among the processors such that the load is balanced and the inter-processor communication is minimized. Graph partitioning algorithms have been applied with much success for this purpose. However, the parallelization of multi-phase and multi-physics computations poses new challenges that require fundamental advances in graph partitioning technology. In addition, most existing graph partitioning algorithms are not suited for the newer heterogeneous highperformance computing platforms. This talk will describe research efforts in our group that are focused on developing novel multi-constraint and multi-objective graph partitioning algorithms that can support the advancing state-ofthe-art in numerical simulation technologies. In addition, we will present our preliminary work on new partitioning algorithms that are well suited for heterogeneous architectures.
Proceedings of the IEEE/ACM SC98 Conference, 1998
One ingredient which is viewed as vital to the successful conduct of many large-scale numerical s... more One ingredient which is viewed as vital to the successful conduct of many large-scale numerical simulations is the ability to dynamically repartition the underlying adaptive finite element mesh among the processors so that the computations are balanced and interprocessor communication is minimized. This requires that a sequence of partitions of the computational mesh be computed during the course of the computation in which the amount of data migration necessary to realize subsequent partitions is minimized, while all of the domains of a given partition contain a roughly equal amount of computational weight. Recently, parallel multilevel graph repartitioning techniques have been developed that can quickly compute high-quality repartitions for adaptive and dynamic meshes while minimizing the amount of data which needs to be migrated between file:///C|/Karypis/Work/001%20Performance%20Review/T...partitioning%20of%20adaptively%20refine%20meshes.htm (1 of 13)8/30/2003 5:56:40 AM Dynamic Repartitioning of Adaptively Refined Meshes processors. These algorithms can be categorized as either schemes which compute a new partition from scratch and then intelligently remap this partition to the original partition (hereafter referred to as scratch-remap schemes), or multilevel diffusion schemes. Scratch-remap schemes work quite well for graphs which are highly imbalanced in localized areas. On slightly to moderately imbalanced graphs and those in which imbalance occurs globally throughout the graph, however, they result in excessive vertex migration compared to multilevel diffusion algorithms. On the other hand, diffusion-based schemes work well for slightly imbalanced graphs and for those in which imbalance occurs globally throughout the graph. However, these schemes perform poorly on graphs that are highly imbalanced in localized areas, as the propagation of diffusion over long distances results in excessive edge-cut and vertex migration results. In this paper, we present two new schemes for adaptive repartitioning: Locally-Matched Multilevel Scratch-Remap (or LMSR) and Wavefront Diffusion. The LMSR scheme performs purely local coarsening and partition remapping in a multilevel context. In Wavefront Diffusion, the flow of vertices move in a wavefront from overbalanced to underbalanced domains. We present experimental evaluations of our LMSR and Wavefront Diffusion algorithms on synthetically generated adaptive meshes as well as on some application meshes. We show that our LMSR algorithm decreases the amount of vertex migration required to balance the graph and produces repartitionings of similar quality compared to state-of-the-art scratch-remap schemes. Furthermore, we show that our LMSR algorithm is more scalable in terms of execution time compared to state-of-the-art scratch-remap schemes. We show that our Wavefront Diffusion algorithm obtains significantly lower vertex migration requirements, while maintaining similar edge-cut results compared to state-of-the-art multilevel diffusion algorithms, especially for highly imbalanced graphs. Furthermore, we compare Wavefront Diffusion with LMSR and show that the former will result in lower vertex migration requirements and the later will result in higher quality edge-cut results. These results hold true regardless of the distance which diffusion is required to propagate in order to balance the graph. Finally, we discuss the run times of our schemes which are both capable of repartitioning an eight million node graph in under three seconds on a 128-processor Cray T3E.
Proceedings of the 1999 International Conference on Parallel Processing, 1999
Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556), 2000
A job submitted to the grid can be executed by any of the servers; however, resource size or bala... more A job submitted to the grid can be executed by any of the servers; however, resource size or balance may be different across servers. One approach to resource management for this grid is to layer a global load distribution system on top of the local job management systems at each site. Unfortunately, classical load distribution policies fail on two aspects when applied to a multi-resource server grid. First, simple load indices may not recognize that a resource imbalance exists at a server. Second, classical job selection policies do not actively correct such a resource imbalanced state. We show through simulation that new policies based on resource balancing perform consistently better than the classical load distribution strategies.
CHO cells, the workhorses of the biopharmaceutical industry, are derived from the Chinese hamster... more CHO cells, the workhorses of the biopharmaceutical industry, are derived from the Chinese hamster, arguably making it the most economically important industrial organism. The synergistic application of high-throughput sequencing technologies, along with the existing CHO EST collection as backbone, enabled the efficient assembly of the Chinese hamster genome. The current assembly (~2.5Gb), constituting over two billion sequence reads, includes more than 25,000 annotated genes across a range of functional classes. This has allowed a global comparative analysis with the mouse, rat and human genomes. Furthermore, the investigation of regulatory features including promoters, CpG Islands and microRNAs has opened up new avenues for manipulating individual gene expression as well as genome level interventions. In addition, this work aims to study the genetic variation underlying economically important productivity traits in CHO cells, by a comparative genomics approach, with diploid hamster...
Methods and Algorithms, 2010
Proceedings of the 2007 SIAM International Conference on Data Mining, 2007
In this paper, we explore the discriminating subsequencebased clustering problem. First, several ... more In this paper, we explore the discriminating subsequencebased clustering problem. First, several effective optimization techniques are proposed to accelerate the sequence mining process and a new algorithm, CONTOUR, is developed to efficiently and directly mine a subset of discriminating frequent subsequences which can be used to cluster the input sequences. Second, an accurate hierarchical clustering algorithm, SSC, is constructed based on the result of CON-TOUR. The performance study evaluates the efficiency and scalability of CONTOUR, and the clustering quality of SSC.