George Karypis | University of Minnesota (original) (raw)

Papers by George Karypis

Research paper thumbnail of An analysis of information content present in protein-DNA interactions

Understanding the role proteins play in regulating DNA replication is essential to forming a comp... more Understanding the role proteins play in regulating DNA replication is essential to forming a complete picture of how the genome manifests itself. In this work, we examine the feasibility of predicting the residues of a protein essential to binding by analyzing protein-DNA interactions from an information theoretic perspective. Through the lens of mutual information, we explore which properties of protein sequence and structure are most useful in determining binding residues with a particular focus on sequence features. We find that the quantity of information carried in most features is small with respect to DNA-contacting residues, the bulk being provided by sequence features along with a select few structural features.

Research paper thumbnail of An analysis of information content present in protein-DNA interactions

Understanding the role proteins play in regulating DNA replication is essential to forming a comp... more Understanding the role proteins play in regulating DNA replication is essential to forming a complete picture of how the genome manifests itself. In this work, we examine the feasibility of predicting the residues of a protein essential to binding by analyzing protein-DNA interactions from an information theoretic perspective. Through the lens of mutual information, we explore which properties of protein sequence and structure are most useful in determining binding residues with a particular focus on sequence features. We find that the quantity of information carried in most features is small with respect to DNA-contacting residues, the bulk being provided by sequence features along with a select few structural features.

Research paper thumbnail of TOPTMH: Topology Predictor for Transmembrane Helices

Motivation: Alpha-helical transmembrane proteins mediate many key biological pro- cesses and repr... more Motivation: Alpha-helical transmembrane proteins mediate many key biological pro- cesses and represent 20-30% of all genes in many organisms. Due to the diculties in experimentally determining their high-resolution 3D struc- ture, computational methods that predict their topology (transmembrane helical segments and their orientation) are essential in advancing the un- derstanding of membrane proteins' structures and functions. 2 Methods: We developed

Research paper thumbnail of TOPTMH: TOPOLOGY PREDICTOR FOR TRANSMEMBRANE α-HELICES

Journal of Bioinformatics and Computational Biology, 2010

Alpha-helical transmembrane proteins mediate many key biological processes and represent 20-30% o... more Alpha-helical transmembrane proteins mediate many key biological processes and represent 20-30% of all genes in many organisms. Due to the difficulties in experimentally determining their high-resolution 3D structure, computational methods that predict their topology (transmembrane helical segments and their orientation) are essential in advancing the understanding of membrane proteins' structures and functions.

Research paper thumbnail of Load balancing of dynamic and adaptive mesh-based computations

Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281), 1998

ABSTRACT

Research paper thumbnail of Graph partitioning for dynamic, adaptive and multi-phase scientific simulations

Proceedings 42nd IEEE Symposium on Foundations of Computer Science, 2001

The efficient execution of scientific simulations on HPC systems requires a partitioning of the u... more The efficient execution of scientific simulations on HPC systems requires a partitioning of the underlying mesh among the processors such that the load is balanced and the inter-processor communication is minimized. Graph partitioning algorithms have been applied with much success for this purpose. However, the parallelization of multi-phase and multi-physics computations poses new challenges that require fundamental advances in graph partitioning technology. In addition, most existing graph partitioning algorithms are not suited for the newer heterogeneous highperformance computing platforms. This talk will describe research efforts in our group that are focused on developing novel multi-constraint and multi-objective graph partitioning algorithms that can support the advancing state-ofthe-art in numerical simulation technologies. In addition, we will present our preliminary work on new partitioning algorithms that are well suited for heterogeneous architectures.

Research paper thumbnail of Dynamic Repartitioning of Adaptively Refined Meshes

Proceedings of the IEEE/ACM SC98 Conference, 1998

One ingredient which is viewed as vital to the successful conduct of many large-scale numerical s... more One ingredient which is viewed as vital to the successful conduct of many large-scale numerical simulations is the ability to dynamically repartition the underlying adaptive finite element mesh among the processors so that the computations are balanced and interprocessor communication is minimized. This requires that a sequence of partitions of the computational mesh be computed during the course of the computation in which the amount of data migration necessary to realize subsequent partitions is minimized, while all of the domains of a given partition contain a roughly equal amount of computational weight. Recently, parallel multilevel graph repartitioning techniques have been developed that can quickly compute high-quality repartitions for adaptive and dynamic meshes while minimizing the amount of data which needs to be migrated between file:///C|/Karypis/Work/001%20Performance%20Review/T...partitioning%20of%20adaptively%20refine%20meshes.htm (1 of 13)8/30/2003 5:56:40 AM Dynamic Repartitioning of Adaptively Refined Meshes processors. These algorithms can be categorized as either schemes which compute a new partition from scratch and then intelligently remap this partition to the original partition (hereafter referred to as scratch-remap schemes), or multilevel diffusion schemes. Scratch-remap schemes work quite well for graphs which are highly imbalanced in localized areas. On slightly to moderately imbalanced graphs and those in which imbalance occurs globally throughout the graph, however, they result in excessive vertex migration compared to multilevel diffusion algorithms. On the other hand, diffusion-based schemes work well for slightly imbalanced graphs and for those in which imbalance occurs globally throughout the graph. However, these schemes perform poorly on graphs that are highly imbalanced in localized areas, as the propagation of diffusion over long distances results in excessive edge-cut and vertex migration results. In this paper, we present two new schemes for adaptive repartitioning: Locally-Matched Multilevel Scratch-Remap (or LMSR) and Wavefront Diffusion. The LMSR scheme performs purely local coarsening and partition remapping in a multilevel context. In Wavefront Diffusion, the flow of vertices move in a wavefront from overbalanced to underbalanced domains. We present experimental evaluations of our LMSR and Wavefront Diffusion algorithms on synthetically generated adaptive meshes as well as on some application meshes. We show that our LMSR algorithm decreases the amount of vertex migration required to balance the graph and produces repartitionings of similar quality compared to state-of-the-art scratch-remap schemes. Furthermore, we show that our LMSR algorithm is more scalable in terms of execution time compared to state-of-the-art scratch-remap schemes. We show that our Wavefront Diffusion algorithm obtains significantly lower vertex migration requirements, while maintaining similar edge-cut results compared to state-of-the-art multilevel diffusion algorithms, especially for highly imbalanced graphs. Furthermore, we compare Wavefront Diffusion with LMSR and show that the former will result in lower vertex migration requirements and the later will result in higher quality edge-cut results. These results hold true regardless of the distance which diffusion is required to propagate in order to balance the graph. Finally, we discuss the run times of our schemes which are both capable of repartitioning an eight million node graph in under three seconds on a 128-processor Cray T3E.

Research paper thumbnail of Multi-capacity bin packing algorithms with applications to job scheduling under multiple constraints

Proceedings of the 1999 International Conference on Parallel Processing, 1999

Research paper thumbnail of Load balancing across near-homogeneous multi-resource servers

Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556), 2000

A job submitted to the grid can be executed by any of the servers; however, resource size or bala... more A job submitted to the grid can be executed by any of the servers; however, resource size or balance may be different across servers. One approach to resource management for this grid is to layer a global load distribution system on top of the local job management systems at each site. Unfortunately, classical load distribution policies fail on two aspects when applied to a multi-resource server grid. First, simple load indices may not recognize that a resource imbalance exists at a server. Second, classical job selection policies do not actively correct such a resource imbalanced state. We show through simulation that new policies based on resource balancing perform consistently better than the classical load distribution strategies.

Research paper thumbnail of The Genome Sequence of the Chinese Hamster:Ushering In An Era of CHO Genome Engineering

CHO cells, the workhorses of the biopharmaceutical industry, are derived from the Chinese hamster... more CHO cells, the workhorses of the biopharmaceutical industry, are derived from the Chinese hamster, arguably making it the most economically important industrial organism. The synergistic application of high-throughput sequencing technologies, along with the existing CHO EST collection as backbone, enabled the efficient assembly of the Chinese hamster genome. The current assembly (~2.5Gb), constituting over two billion sequence reads, includes more than 25,000 annotated genes across a range of functional classes. This has allowed a global comparative analysis with the mouse, rat and human genomes. Furthermore, the investigation of regulatory features including promoters, CpG Islands and microRNAs has opened up new avenues for manipulating individual gene expression as well as genome level interventions. In addition, this work aims to study the genetic variation underlying economically important productivity traits in CHO cells, by a comparative genomics approach, with diploid hamster...

Research paper thumbnail of Ligand-Binding Residue Prediction

Methods and Algorithms, 2010

Research paper thumbnail of Discriminating Subsequence Discovery for Sequence Clustering

Proceedings of the 2007 SIAM International Conference on Data Mining, 2007

In this paper, we explore the discriminating subsequencebased clustering problem. First, several ... more In this paper, we explore the discriminating subsequencebased clustering problem. First, several effective optimization techniques are proposed to accelerate the sequence mining process and a new algorithm, CONTOUR, is developed to efficiently and directly mine a subset of discriminating frequent subsequences which can be used to cluster the input sequences. Second, an accurate hierarchical clustering algorithm, SSC, is constructed based on the result of CON-TOUR. The performance study evaluates the efficiency and scalability of CONTOUR, and the clustering quality of SSC.

Research paper thumbnail of Search eLibrary

Research paper thumbnail of ICDM 2013 Program Co-Chairs

Research paper thumbnail of Program Co-Chairs

Research paper thumbnail of 8.7 Parallel Algorithms in Data Mining

Research paper thumbnail of TR O0-O14

Research paper thumbnail of TR O0-O57

Research paper thumbnail of TR O1-O20

Research paper thumbnail of TR O2-O16

Research paper thumbnail of An analysis of information content present in protein-DNA interactions

Understanding the role proteins play in regulating DNA replication is essential to forming a comp... more Understanding the role proteins play in regulating DNA replication is essential to forming a complete picture of how the genome manifests itself. In this work, we examine the feasibility of predicting the residues of a protein essential to binding by analyzing protein-DNA interactions from an information theoretic perspective. Through the lens of mutual information, we explore which properties of protein sequence and structure are most useful in determining binding residues with a particular focus on sequence features. We find that the quantity of information carried in most features is small with respect to DNA-contacting residues, the bulk being provided by sequence features along with a select few structural features.

Research paper thumbnail of An analysis of information content present in protein-DNA interactions

Understanding the role proteins play in regulating DNA replication is essential to forming a comp... more Understanding the role proteins play in regulating DNA replication is essential to forming a complete picture of how the genome manifests itself. In this work, we examine the feasibility of predicting the residues of a protein essential to binding by analyzing protein-DNA interactions from an information theoretic perspective. Through the lens of mutual information, we explore which properties of protein sequence and structure are most useful in determining binding residues with a particular focus on sequence features. We find that the quantity of information carried in most features is small with respect to DNA-contacting residues, the bulk being provided by sequence features along with a select few structural features.

Research paper thumbnail of TOPTMH: Topology Predictor for Transmembrane Helices

Motivation: Alpha-helical transmembrane proteins mediate many key biological pro- cesses and repr... more Motivation: Alpha-helical transmembrane proteins mediate many key biological pro- cesses and represent 20-30% of all genes in many organisms. Due to the diculties in experimentally determining their high-resolution 3D struc- ture, computational methods that predict their topology (transmembrane helical segments and their orientation) are essential in advancing the un- derstanding of membrane proteins' structures and functions. 2 Methods: We developed

Research paper thumbnail of TOPTMH: TOPOLOGY PREDICTOR FOR TRANSMEMBRANE α-HELICES

Journal of Bioinformatics and Computational Biology, 2010

Alpha-helical transmembrane proteins mediate many key biological processes and represent 20-30% o... more Alpha-helical transmembrane proteins mediate many key biological processes and represent 20-30% of all genes in many organisms. Due to the difficulties in experimentally determining their high-resolution 3D structure, computational methods that predict their topology (transmembrane helical segments and their orientation) are essential in advancing the understanding of membrane proteins' structures and functions.

Research paper thumbnail of Load balancing of dynamic and adaptive mesh-based computations

Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281), 1998

ABSTRACT

Research paper thumbnail of Graph partitioning for dynamic, adaptive and multi-phase scientific simulations

Proceedings 42nd IEEE Symposium on Foundations of Computer Science, 2001

The efficient execution of scientific simulations on HPC systems requires a partitioning of the u... more The efficient execution of scientific simulations on HPC systems requires a partitioning of the underlying mesh among the processors such that the load is balanced and the inter-processor communication is minimized. Graph partitioning algorithms have been applied with much success for this purpose. However, the parallelization of multi-phase and multi-physics computations poses new challenges that require fundamental advances in graph partitioning technology. In addition, most existing graph partitioning algorithms are not suited for the newer heterogeneous highperformance computing platforms. This talk will describe research efforts in our group that are focused on developing novel multi-constraint and multi-objective graph partitioning algorithms that can support the advancing state-ofthe-art in numerical simulation technologies. In addition, we will present our preliminary work on new partitioning algorithms that are well suited for heterogeneous architectures.

Research paper thumbnail of Dynamic Repartitioning of Adaptively Refined Meshes

Proceedings of the IEEE/ACM SC98 Conference, 1998

One ingredient which is viewed as vital to the successful conduct of many large-scale numerical s... more One ingredient which is viewed as vital to the successful conduct of many large-scale numerical simulations is the ability to dynamically repartition the underlying adaptive finite element mesh among the processors so that the computations are balanced and interprocessor communication is minimized. This requires that a sequence of partitions of the computational mesh be computed during the course of the computation in which the amount of data migration necessary to realize subsequent partitions is minimized, while all of the domains of a given partition contain a roughly equal amount of computational weight. Recently, parallel multilevel graph repartitioning techniques have been developed that can quickly compute high-quality repartitions for adaptive and dynamic meshes while minimizing the amount of data which needs to be migrated between file:///C|/Karypis/Work/001%20Performance%20Review/T...partitioning%20of%20adaptively%20refine%20meshes.htm (1 of 13)8/30/2003 5:56:40 AM Dynamic Repartitioning of Adaptively Refined Meshes processors. These algorithms can be categorized as either schemes which compute a new partition from scratch and then intelligently remap this partition to the original partition (hereafter referred to as scratch-remap schemes), or multilevel diffusion schemes. Scratch-remap schemes work quite well for graphs which are highly imbalanced in localized areas. On slightly to moderately imbalanced graphs and those in which imbalance occurs globally throughout the graph, however, they result in excessive vertex migration compared to multilevel diffusion algorithms. On the other hand, diffusion-based schemes work well for slightly imbalanced graphs and for those in which imbalance occurs globally throughout the graph. However, these schemes perform poorly on graphs that are highly imbalanced in localized areas, as the propagation of diffusion over long distances results in excessive edge-cut and vertex migration results. In this paper, we present two new schemes for adaptive repartitioning: Locally-Matched Multilevel Scratch-Remap (or LMSR) and Wavefront Diffusion. The LMSR scheme performs purely local coarsening and partition remapping in a multilevel context. In Wavefront Diffusion, the flow of vertices move in a wavefront from overbalanced to underbalanced domains. We present experimental evaluations of our LMSR and Wavefront Diffusion algorithms on synthetically generated adaptive meshes as well as on some application meshes. We show that our LMSR algorithm decreases the amount of vertex migration required to balance the graph and produces repartitionings of similar quality compared to state-of-the-art scratch-remap schemes. Furthermore, we show that our LMSR algorithm is more scalable in terms of execution time compared to state-of-the-art scratch-remap schemes. We show that our Wavefront Diffusion algorithm obtains significantly lower vertex migration requirements, while maintaining similar edge-cut results compared to state-of-the-art multilevel diffusion algorithms, especially for highly imbalanced graphs. Furthermore, we compare Wavefront Diffusion with LMSR and show that the former will result in lower vertex migration requirements and the later will result in higher quality edge-cut results. These results hold true regardless of the distance which diffusion is required to propagate in order to balance the graph. Finally, we discuss the run times of our schemes which are both capable of repartitioning an eight million node graph in under three seconds on a 128-processor Cray T3E.

Research paper thumbnail of Multi-capacity bin packing algorithms with applications to job scheduling under multiple constraints

Proceedings of the 1999 International Conference on Parallel Processing, 1999

Research paper thumbnail of Load balancing across near-homogeneous multi-resource servers

Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556), 2000

A job submitted to the grid can be executed by any of the servers; however, resource size or bala... more A job submitted to the grid can be executed by any of the servers; however, resource size or balance may be different across servers. One approach to resource management for this grid is to layer a global load distribution system on top of the local job management systems at each site. Unfortunately, classical load distribution policies fail on two aspects when applied to a multi-resource server grid. First, simple load indices may not recognize that a resource imbalance exists at a server. Second, classical job selection policies do not actively correct such a resource imbalanced state. We show through simulation that new policies based on resource balancing perform consistently better than the classical load distribution strategies.

Research paper thumbnail of The Genome Sequence of the Chinese Hamster:Ushering In An Era of CHO Genome Engineering

CHO cells, the workhorses of the biopharmaceutical industry, are derived from the Chinese hamster... more CHO cells, the workhorses of the biopharmaceutical industry, are derived from the Chinese hamster, arguably making it the most economically important industrial organism. The synergistic application of high-throughput sequencing technologies, along with the existing CHO EST collection as backbone, enabled the efficient assembly of the Chinese hamster genome. The current assembly (~2.5Gb), constituting over two billion sequence reads, includes more than 25,000 annotated genes across a range of functional classes. This has allowed a global comparative analysis with the mouse, rat and human genomes. Furthermore, the investigation of regulatory features including promoters, CpG Islands and microRNAs has opened up new avenues for manipulating individual gene expression as well as genome level interventions. In addition, this work aims to study the genetic variation underlying economically important productivity traits in CHO cells, by a comparative genomics approach, with diploid hamster...

Research paper thumbnail of Ligand-Binding Residue Prediction

Methods and Algorithms, 2010

Research paper thumbnail of Discriminating Subsequence Discovery for Sequence Clustering

Proceedings of the 2007 SIAM International Conference on Data Mining, 2007

In this paper, we explore the discriminating subsequencebased clustering problem. First, several ... more In this paper, we explore the discriminating subsequencebased clustering problem. First, several effective optimization techniques are proposed to accelerate the sequence mining process and a new algorithm, CONTOUR, is developed to efficiently and directly mine a subset of discriminating frequent subsequences which can be used to cluster the input sequences. Second, an accurate hierarchical clustering algorithm, SSC, is constructed based on the result of CON-TOUR. The performance study evaluates the efficiency and scalability of CONTOUR, and the clustering quality of SSC.

Research paper thumbnail of Search eLibrary

Research paper thumbnail of ICDM 2013 Program Co-Chairs

Research paper thumbnail of Program Co-Chairs

Research paper thumbnail of 8.7 Parallel Algorithms in Data Mining

Research paper thumbnail of TR O0-O14

Research paper thumbnail of TR O0-O57

Research paper thumbnail of TR O1-O20

Research paper thumbnail of TR O2-O16