Anirban Mukhopadhyay | University of Kalyani (original) (raw)

Papers by Anirban Mukhopadhyay

PloS one, 2015

Microarray and beadchip are two most efficient techniques for measuring gene expression and methy... more Microarray and beadchip are two most efficient techniques for measuring gene expression and methylation data in bioinformatics. Biclustering deals with the simultaneous clustering of genes and samples. In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusionmaximal biclustering techniques from the biological datasets. At first, a novel statistical strategy has been utilized to eliminate the insignificant/low-significant/redundant genes in such way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). The data is then discretized and post-discretized, consecutively. Thereafter, the biclustering technique is applied to identify maximal frequent closed homogeneous itemsets. Corresponding special type of rules are then extracted from the selected itemsets. Our proposed rule mining method performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets. Thus, it saves elapsed time, and can work on big dataset. Pathway and Gene Ontology analyses are conducted on the genes of the evolved rules using David database. Frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers. Furthermore, we also classify the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we also compare the average classification accuracy, and other related factors with other rule-based classifiers. Statistical significance tests are also performed for verifying the statistical relevance of the comparative results. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data-matrix. Finally, we have also included the integrated analysis of gene expression and methylation for determining epigenetic effect (viz., effect of methylation) on gene expression level.

2015 Eighth International Conference on Advances in Pattern Recognition (ICAPR), 2015

Normally, statistical methods are used to generate rankings for genes in terms of their ability t... more Normally, statistical methods are used to generate rankings for genes in terms of their ability to distinguish between normal and malignant tumors from a gene expression dataset. However, different statistical methods yield different ranks for same gene and there is no universally accepted method for ranking. Therefore rank aggregation is required to find the overall ranking of the set of genes. There are various rank aggregation methods in the existing literature to integrate the rankings produced by various statistical tests. Moreover, the problem of integration of some partial rankings, containing unequal numbers of genes, is more challenging. In this article, a multiobjective genetic algorithm based rank aggregation method is proposed to integrate some partial rankings in an unbiased way. The first objective is to minimize the total distance from the reference ranking to the input rankings. For distance calculation, the Scaled Footrule Distance is used. The second objective is to minimize the standard deviation among those distances in order to avoid bias toward a particular input ranking. The proposed method is applied on some real-life microarray gene expression datasets, and the performance of it is compared with that of several existing rank aggregation techniques with respect to accuracy and the AUC (Area under ROC curve) value. Again, for real-life datasets, accuracy is plotted for visual comparison.

Microarray experiments generate a large amount of data which is used to discover the genetic back... more Microarray experiments generate a large amount of data which is used to discover the genetic background of diseases and to know the characteristics of genes. Clustering the tissue samples according to their co-expressed behavior and characteristics is an important tool for partitioning the dataset. Finding the clusters of a given dataset is a difficult task. This task of clustering is even more difficult when we try to find the rank of each gene, which is known as Gene Ranking, according to their abilities to distinguish different classes of samples. In the literature, many algorithms are available for sample clustering and gene ranking or selection, separately. A few algorithms are also available for simultaneous clustering and feature selection. In this article, we have proposed a new approach for clustering the samples and ranking the genes, simultaneously. A novel encoding technique for the chromosomes is proposed for this purpose and the work is accompleshed using a multi-objective evolutionary technique. Results have been demonstrated for both artificial and real-life gene expression data sets.

ABSTRACT In this article, we have considered the problem of fuzzy clustering of categorical data.... more ABSTRACT In this article, we have considered the problem of fuzzy clustering of categorical data. In this regard, the well-known fuzzy C-medoids algorithm for categorical data clustering is posed as a multiobjective optimization problem where the cluster medoids are encoded in the chromosomes of a multiobjective genetic algorithm. The chromosomes are of variable lengths to permit automatic evolution of the number of clusters. The chromosomes are updated through the medoid updating process of fuzzy C-medoids clustering. The fuzzy cluster variance and cluster separation are taken as the two objectives to be optimized simultaneously. The performance of the proposed algorithm has been compared with that of different well-known categorical data clustering algorithms and demonstrated for a variety of synthetic and real-life categorical data sets.

ABSTRACT Detection of protein complexes within protein-protein interaction networks (PPIN) is a v... more ABSTRACT Detection of protein complexes within protein-protein interaction networks (PPIN) is a valuable step toward the analysis of biological processes and pathways. Several high-throughput experimental techniques produce large number of PPIs that can be extensively utilized for constructing PPI network of a species. Decomposition of the whole PPI network into smaller and manageable modules is an ongoing challenge. Here we have developed a multi-objective algorithm for detecting human protein complexes by partitioning large human PPI network into clusters which serve as protein complexes. Some graphical properties like density, centrality etc., are utilized for building the objectives. Besides the graphical properties we have also exploited a fuzzy measure based semantic similarity approach to construct similarity based objective. The proposed technique is demonstrated in the human PPI network and the resulting complexes are analyzed in context of Gene Ontology (GO) and pathway enrichment. We have also compared our results with that of some state-of-the-art algorithms in context of different performance metrics. The biological relevance of our predicted complexes are also established here by linking them with 22 key disease classes.

Ranking of association rules is currently an interesting topic in data mining and bioinformatics.... more Ranking of association rules is currently an interesting topic in data mining and bioinformatics. The huge number of evolved rules of items (or, genes) by association rule mining (ARM) algorithms makes confusion to the decision maker. In this article, we propose a weighted rule-mining technique (say, RAN W AR or rank-based weighted association rule-mining) to rank the rules using two novel rule-interestingness measures, viz., rank-based weighted condensed support (wcs) and weighted condensed confidence (wcc) measures to bypass the problem. These measures are basically depended on the rank of items (genes). Using the rank, we assign weight to each item. RAN W AR generates much less number of frequent itemsets than the state-of-the-art association rule mining algorithms. Thus, it saves time of execution of the algorithm. We run RAN W AR on gene expression and methylation datasets. The genes of the top rules are biologically validated by Gene Ontologies (GOs) and KEGG pathway analyses. Many top ranked rules extracted from RAN W AR that hold poor ranks in traditional Apriori, are highly biologically significant to the related diseases. Finally, the top rules evolved from RAN W AR, that are not in Apriori, are reported.

ABSTRACT In this article a fuzzy rule-based classifier has been designed on the framework of mult... more ABSTRACT In this article a fuzzy rule-based classifier has been designed on the framework of multiobjective Particle Swarm Optimization. The proposed approach is applied on microarray gene expression data to obtain genes with significant expression with respect to two different classes. Two fuzzy sets are represented with the linguistic values “high” and “low”. On the training dataset, the proposed approach is applied for the purpose of deriving good classification rules. To be precise, the good rules are those that have less attributes in the antecedent part and provide maximum accuracy. Moreover we also consider the existing redundancy among the selected rules which should be minimized. Here the underlying structure is modeled using multiobjective PSO with the support of non-dominated sorting and crowding distance sorting. The first objective is to maximize the classification accuracy and second objective is to minimize the rule-base complexity (number of rules and average rule length) and the redundancy of the rules. The performance of the proposed algorithm is compared with that of single objective versions, Support Vector machine classifier and Bayes classifier on several real-life datasets.

In this article, an improved feature selection technique has been proposed. Mutual Information is... more In this article, an improved feature selection technique has been proposed. Mutual Information is taken as the basic criterion to find the feature relevance and redundancy. The mutual information between a feature and class labels defines the relevance of that feature. Again, the mutual information among different features defines the correlation i.e., the redundancy among those features. Now our objective is to find such a feature set for which the mutual information among the features and the class labels are maximized and the mutual information among the features are minimized. Therefore, the goal of the proposed method is to find the most relevant and least redundant feature set. The number of output features is provided by the user. First the most relevant feature is added to the empty final feature set. Then in each iteration a non-dominated feature set with respect to relevance and redundancy is generated and from this set of features, the most relevant and non-redundant feature is included in the final feature set. Thereafter, in an incremental way a feature is added in every iteration and this step is repeated while the size of the final feature set is equal to the user given number of features. The features contained by the final feature set have maximum relevance and least correlation. The proposed method is applied on microarray gene expression data to find the most relevant and non-redundant genes and the performance of the proposed method is compared with that of the popular mRMR (MIQ) and mRMR (MID) schemes on several real-life data sets.

Finding the most significant gene from microarray time series data is important for designing dru... more Finding the most significant gene from microarray time series data is important for designing drugs of particular disease. Construction of Neural Network through protein interactions is a vital and useful approach to develop new drugs target. Some of the computational tools are being utilized for predicting the viral-host interactions. The database of human HIV-1 Vpr mutant gene expression microarray time series expression value contain records of experimentally validated interactions. The main problem to analyze this type of microarray data is classification problem as because human HIV-1 Vpr mutant cell is an infected dendritic cell. We firstly, have clustered the gene microarray time series data using subtractive clustering method then construct Radial Basis Neural Network on cluster of HIV-1 Vpr mutant microarray time series data. The network output is optimized by using Genetic Algorithm and from the optimized value of network output we got a significant gene which lead to drug discovery in future.

ABSTRACT Reactive power dispatch (RPD) is one of the major issues of modern energy management sys... more ABSTRACT Reactive power dispatch (RPD) is one of the major issues of modern energy management system. This article presents an efficient genetic algorithm (GA) approach for modelling and solving RPD problem of power system in the framework of fuzzy goal programming (FGP) in uncertain environment. In the proposed approach, the objectives of the problem concerned with RPD problem are fuzzily described. In the solution process, the proposed GA method is used in the framework of FGP model in an iterative manner to reach a satisfactory decision. The proposed approach is tested on the standard IEEE 6-Generator 30-Bus System and compared with the solutions obtained in previous study. (C) 2013 The Authors. Published by Elsevier Ltd.

Cancer is an extremely complex, heterogeneous and mutated genetic disease. Many researchers in mo... more Cancer is an extremely complex, heterogeneous and mutated genetic disease. Many researchers in molecular genetics have predicted a number of key genes which probably contribute to oncogenesis and potential drug targets for different types of cancer. But still this is an ongoing process. In this article, we not only consider the gene relevance, but also the redundancy among genes is taken care of. For identifying the non-redundant gene markers from microarray gene expression data, a graph-theoretic approach has been presented. The sample versus gene data presented by microarray data is first converted into a weighted undirected complete featuregraph where the nodes represent the genes having gene's relevance as node weights and the edges are weighted according to the similarity value (correlation) among the genes. Then the densest subgraph having minimum average edge weight (similarity) and maximum average node weight (relevance) is identified from the original feature-graph. To find the densest subgraph, binary particle swarm optimization has been applied for minimizing the average edge weight and maximizing the average node weight through a single objective function. Thus an optimized reduced subgraph is found which contains a set of selected genes for which average correlation is very less and average gene relevance is very high. The proposed method is compared with SFS, T-test, Ranksum test, mRMR scheme, CFS, SBE and FCBF in terms of sensitivity, specificity, accuracy, fscore, Area Under ROC Curve (AUC), average correlation and stability on several real-life data sets.

Identifying relevant genes which are responsible for various types of cancer is an important prob... more Identifying relevant genes which are responsible for various types of cancer is an important problem. In this context, important genes refer to the marker genes which change their expression level in correlation with the risk or progression of a disease, or with the susceptibility of the disease to a given treatment. Gene expression profiling by microarray technology has been successfully applied to classification and diagnostic prediction of cancers. However, extracting these marker genes from a huge set of genes contained by the microarray data set is a major problem. Most of the existing methods for identifying marker genes find a set of genes which may be redundant in nature. Motivated by this, a multiobjective optimization method has been proposed which can find a small set of non-redundant disease related genes providing high sensitivity and specificity simultaneously. In this article, the optimization problem has been modeled as a multiobjective one which is based on the framework of variable length particle swarm optimization. Using some real-life data sets, the performance of the proposed algorithm has been compared with that of other state-of-the-art techniques.

These authors contributed equally to this work.

Predicting Protein Subcellular Localization Using Intelligent Systems Rajesh Nair and Burkhard Ro... more Predicting Protein Subcellular Localization Using Intelligent Systems Rajesh Nair and Burkhard Rost Columbia University CONTENTS 10.1 Introduction 262 10.1. 1 Decoding Protein Function: A Major Challenge for Modern Biology 262 10.1. 1.1 Protein Function Has ...

Objective. To identify differentially expressed genes in synovial fibroblasts and examine the eff... more Objective. To identify differentially expressed genes in synovial fibroblasts and examine the effect on gene expression of exposure to TNF-a and IL-1b. Methods. Restriction fragment differential display was used to isolate genes using degenerate primers complementary to the lysophosphatidic acid acyl transferase gene family. Differential gene expression was confirmed by reverse transcriptionpolymerase chain reaction and immunohistochemistry using a variety of synovial fibroblasts, including cells from patients with osteoarthritis and self-limiting parvovirus arthritis. Results. Irrespective of disease process, synovial fibroblasts constitutively produced higher levels of IL-6 and monocyte chemoattractant protein 1 (MCP-1) (CCL2) than skin fibroblasts. Seven genes were differentially expressed in synovial fibroblasts compared with skin fibroblasts. Of these genes, four [tissue factor pathway inhibitor 2 (TFPI2), growth regulatory oncogene b (GROb), manganese superoxide dismutase (MnSOD) and granulocyte chemotactic protein 2 (GCP-2)] were all found to be constitutively overexpressed in synoviocytes derived from patients with osteoarthritis. These four genes were only weakly expressed in other synovial fibroblasts (rheumatoid and self-limiting parvovirus infection). However, expression in all types of fibroblasts was increased after stimulation with TNF-a and IL-1b. Three other genes (aggrecan, biglycan and caldesmon) were expressed at higher levels in all types of synovial fibroblasts compared with skin fibroblasts even after stimulation with TNF-a and IL-1. Conclusions. Seven genes have been identified with differential expression patterns in terms of disease process (osteoarthritis vs rheumatoid arthritis), state of activation (resting vs cytokine activation) and anatomical location (synovium vs skin). Four of these genes, TFPI2, GROb (CXCL2), MnSOD and GCP-2 (CXCL6), were selectively overexpressed in osteoarthritis fibroblasts rather than rheumatoid fibroblasts. While these differences may represent differential behaviour of synovial fibroblasts in in vitro culture, these observations suggest that TFPI2, GROb (CXCL2), MnSOD and GCP-2 (CXCL6) may represent new targets for treatments specifically tailored to osteoarthritis.

PloS one, 2015

2015 Eighth International Conference on Advances in Pattern Recognition (ICAPR), 2015

These authors contributed equally to this work.