Susmita Datta | University of Louisville, KY (original) (raw)
Papers by Susmita Datta
Journal of Proteomics & Bioinformatics, 2009
ABSTRACT Mass spectrometry has emerged as a core technology for high throughput proteomics profil... more ABSTRACT Mass spectrometry has emerged as a core technology for high throughput proteomics profiling. It has enormous potential in biomedical research. However, the complexity of the data poses new statistical challenges for the analysis. Statistical methods and software developments for analyzing proteomic data are likely to continue to be a major area of research in the coming years.In this paper, a novel statistical method for analyzing high dimensional MALDI-TOF mass-spectrometry data in proteomic research is proposed. The chemical knowledge regarding isotopic distribution of the peptide molecules along with quantitative modeling is used to detect chemically valuable peaks from each spectrum. More specifically, a mixture of location-shifted Poisson distribution is fitted to the deamidated isotopic distribution of a peptide molecule.Maximum likelihood estimation by the expectation-maximization (EM) technique is used to estimate the parameters of the distribution. A formal statistical test is then constructed to determine whether a cluster of consecutive features (intensity values) in a mass spectrum corresponds to a true isotropic pattern. Thus, the monoisotopic peaks in an individual spectrum are identified. Performance of our method is examined through extensive simulations. We also provide a numerical illustration of our method with a real dataset and compare it with an existing method of peak detection. External biochemical validation of our detected peaks is provided.
Statistical Science and Interdisciplinary Research, 2009
PloS one, 2015
The importance of lipids for cell function and health has been widely recognized, e.g., a disorde... more The importance of lipids for cell function and health has been widely recognized, e.g., a disorder in the lipid composition of cells has been related to atherosclerosis caused cardiovascular disease (CVD). Lipidomics analyses are characterized by large yet not a huge number of mutually correlated variables measured and their associations to outcomes are potentially of a complex nature. Differential network analysis provides a formal statistical method capable of inferential analysis to examine differences in network structures of the lipids under two biological conditions. It also guides us to identify potential relationships requiring further biological investigation. We provide a recipe to conduct permutation test on association scores resulted from partial least square regression with multiple imputed lipidomic data from the LUdwigshafen RIsk and Cardiovascular Health (LURIC) study, particularly paying attention to the left-censored missing values typical for a wide range of data...
Bioinformation, 2011
We start by constructing gene-gene association networks based on about 300 genes whose expression... more We start by constructing gene-gene association networks based on about 300 genes whose expression values vary between the groups of CFS patients (plus control). Connected components (modules) from these networks are further inspected for their predictive ability for symptom severity, genotypes of two single nucleotide polymorphisms (SNP) known to be associated with symptom severity, and intensity of the ten most discriminative protein features. We use two different network construction methods and choose the common genes identified in both for added validation. Our analysis identified eleven genes which may play important roles in certain aspects of CFS or related symptoms. In particular, the gene WASF3 (aka WAVE3) possibly regulates brain cytokines involved in the mechanism of fatigue through the p38 MAPK regulatory pathway.
Bioinformation, 2011
In recent years, mass spectrometry has become one of the core technologies for high throughput pr... more In recent years, mass spectrometry has become one of the core technologies for high throughput proteomic profiling in biomedical research. However, reproducibility of the results using this technology was in question. It has been realized that sophisticated automatic signal processing algorithms using advanced statistical procedures are needed to analyze high resolution and high dimensional proteomic data, e.g., Matrix-Assisted Laser Desorption/Ionization Time-of-Flight (MALDI-TOF) data. In this paper we present a software package-pkDACLASS based on R which provides a complete data analysis solution for users of MALDITOF raw data. Complete data analysis comprises data preprocessing, monoisotopic peak detection through statistical model fitting and testing, alignment of the monoisotopic peaks for multiple samples and classification of the normal and diseased samples through the detected peaks. The software provides flexibility to the users to accomplish the complete and integrated an...
BMC bioinformatics, 2006
A cluster analysis is the most commonly performed procedure (often regarded as a first step) on a... more A cluster analysis is the most commonly performed procedure (often regarded as a first step) on a set of gene expression profiles. In most cases, a post hoc analysis is done to see if the genes in the same clusters can be functionally correlated. While past successes of such analyses have often been reported in a number of microarray studies (most of which used the standard hierarchical clustering, UPGMA, with one minus the Pearson's correlation coefficient as a measure of dissimilarity), often times such groupings could be misleading. More importantly, a systematic evaluation of the entire set of clusters produced by such unsupervised procedures is necessary since they also contain genes that are seemingly unrelated or may have more than one common function. Here we quantify the performance of a given unsupervised clustering algorithm applied to a given microarray study in terms of its ability to produce biologically meaningful clusters using a reference set of functional class...
Genetics, 1996
In this paper we use cytonuclear disequilibria to test the neutrality of mtDNA markers. The data ... more In this paper we use cytonuclear disequilibria to test the neutrality of mtDNA markers. The data considered here involve sample frequencies of cytonuclear genotypes subject to both statistical sampling variation as well as genetic sampling variation. First, we obtain the dynamics of the sample cytonuclear disequilibria assuming random drift alone as the source of genetic sampling variation. Next, we develop a test statistic using cytonuclear disequilibria via the theory of generalized least squares to test the random drift model. The null distribution of the test statistic is shown to be approximately chi-squared using an asymptotic argument as well as computer simulation. Power of the test statistic is investigated under an alternative model with drift and selection. The method is illustrated using data from cage experiments utilizing different cytonuclear genotypes of Drosophila melanogaster. A program for implementing the neutrality test is available upon request.
Journal of acquired immune deficiency syndromes and human retrovirology : official publication of the International Retrovirology Association, Jan 15, 1996
Current Phase III trials are designed to assess only a vaccine candidate's ability to reduce ... more Current Phase III trials are designed to assess only a vaccine candidate's ability to reduce susceptibility to infection or disease, that is, vaccine efficacy for susceptibility (VES). Human immunodeficiency virus (HIV) vaccination, however, may reduce the level of infectiousness of vaccinees who become infected, producing an important indirect reduction in HIV transmission even if the vaccine confers only modest protection against infection. We propose two approaches for augmenting the information of a classic trial for estimating protective efficacy that enable the additional estimation of the vaccine's effect on infectiousness, that is, vaccine efficacy for infectiousness (VEI). In the first augmentation, steady sexual partners of trial participants are recruited but not randomized to vaccine or placebo. Their infection status is monitored throughout the trial. In the second augmentation, the sexual partners are randomized. Through computer simulations and analytic method...
2006 International Conference of the IEEE Engineering in Medicine and Biology Society, 2006
Cluster analysis has become a standard part of gene expression analysis. In this paper, we propos... more Cluster analysis has become a standard part of gene expression analysis. In this paper, we propose a novel semi-supervised approach that offers the same flexibility as that of a hierarchical clustering. Yet it utilizes, along with the experimental gene expression data, common biological information about different genes that is being complied at various public, Web accessible databases. We argue that such an approach is inherently superior than the standard unsupervised approach of grouping genes based on expression data alone. It is shown that our biologically supervised methods produce better clustering results than the corresponding unsupervised methods as judged by the distance from the model temporal profiles. R-codes of the clustering algorithm are available from the authors upon request.
Statistical Analysis of Next Generation Sequencing Data, 2014
Statistical Methodology, 2006
This is a comparative study of various clustering and classification algorithms as applied to dif... more This is a comparative study of various clustering and classification algorithms as applied to differentiate cancer and non-cancer protein samples using mass spectrometry data. Our study demonstrates the usefulness of a feature selection step prior to applying a machine learning tool. A natural and common choice of a feature selection tool is the collection of marginal p-values obtained from t-tests for testing the intensity differences at each m/z ratio in the cancer versus non-cancer samples. We study the effect of selecting a cutoff in terms of the overall Type 1 error rate control on the performance of the clustering and classification algorithms using the significant features. For the classification problem, we also considered m/z selection using the importance measures computed by the Random Forest algorithm of Breiman. Using a data set of proteomic analysis of serum from ovarian cancer patients and serum from cancer-free individuals in the Food and Drug Administration and National Cancer Institute Clinical Proteomics Database, we undertake a comparative study of the net effect of the machine learning algorithm-feature selection tool-cutoff criteria combination on the performance as measured by an appropriate error rate measure.
Proceedings of The National Academy of Sciences, 2000
In recent microarray experiments thousands of gene expressions are simultaneously tested in compa... more In recent microarray experiments thousands of gene expressions are simultaneously tested in comparing samples (e.g., tissue types or experimental conditions). Application of a statistical test, such as the t-test, would lead to a p-value for each gene that reflects the amount of statistical evidence present in the data that the given gene is indeed differentially expressed. We show how to
Computational Statistics, 2014
Journal of Proteomics & Bioinformatics, 2009
ABSTRACT Mass spectrometry has emerged as a core technology for high throughput proteomics profil... more ABSTRACT Mass spectrometry has emerged as a core technology for high throughput proteomics profiling. It has enormous potential in biomedical research. However, the complexity of the data poses new statistical challenges for the analysis. Statistical methods and software developments for analyzing proteomic data are likely to continue to be a major area of research in the coming years.In this paper, a novel statistical method for analyzing high dimensional MALDI-TOF mass-spectrometry data in proteomic research is proposed. The chemical knowledge regarding isotopic distribution of the peptide molecules along with quantitative modeling is used to detect chemically valuable peaks from each spectrum. More specifically, a mixture of location-shifted Poisson distribution is fitted to the deamidated isotopic distribution of a peptide molecule.Maximum likelihood estimation by the expectation-maximization (EM) technique is used to estimate the parameters of the distribution. A formal statistical test is then constructed to determine whether a cluster of consecutive features (intensity values) in a mass spectrum corresponds to a true isotropic pattern. Thus, the monoisotopic peaks in an individual spectrum are identified. Performance of our method is examined through extensive simulations. We also provide a numerical illustration of our method with a real dataset and compare it with an existing method of peak detection. External biochemical validation of our detected peaks is provided.
Statistical Science and Interdisciplinary Research, 2009
PloS one, 2015
The importance of lipids for cell function and health has been widely recognized, e.g., a disorde... more The importance of lipids for cell function and health has been widely recognized, e.g., a disorder in the lipid composition of cells has been related to atherosclerosis caused cardiovascular disease (CVD). Lipidomics analyses are characterized by large yet not a huge number of mutually correlated variables measured and their associations to outcomes are potentially of a complex nature. Differential network analysis provides a formal statistical method capable of inferential analysis to examine differences in network structures of the lipids under two biological conditions. It also guides us to identify potential relationships requiring further biological investigation. We provide a recipe to conduct permutation test on association scores resulted from partial least square regression with multiple imputed lipidomic data from the LUdwigshafen RIsk and Cardiovascular Health (LURIC) study, particularly paying attention to the left-censored missing values typical for a wide range of data...
Bioinformation, 2011
We start by constructing gene-gene association networks based on about 300 genes whose expression... more We start by constructing gene-gene association networks based on about 300 genes whose expression values vary between the groups of CFS patients (plus control). Connected components (modules) from these networks are further inspected for their predictive ability for symptom severity, genotypes of two single nucleotide polymorphisms (SNP) known to be associated with symptom severity, and intensity of the ten most discriminative protein features. We use two different network construction methods and choose the common genes identified in both for added validation. Our analysis identified eleven genes which may play important roles in certain aspects of CFS or related symptoms. In particular, the gene WASF3 (aka WAVE3) possibly regulates brain cytokines involved in the mechanism of fatigue through the p38 MAPK regulatory pathway.
Bioinformation, 2011
In recent years, mass spectrometry has become one of the core technologies for high throughput pr... more In recent years, mass spectrometry has become one of the core technologies for high throughput proteomic profiling in biomedical research. However, reproducibility of the results using this technology was in question. It has been realized that sophisticated automatic signal processing algorithms using advanced statistical procedures are needed to analyze high resolution and high dimensional proteomic data, e.g., Matrix-Assisted Laser Desorption/Ionization Time-of-Flight (MALDI-TOF) data. In this paper we present a software package-pkDACLASS based on R which provides a complete data analysis solution for users of MALDITOF raw data. Complete data analysis comprises data preprocessing, monoisotopic peak detection through statistical model fitting and testing, alignment of the monoisotopic peaks for multiple samples and classification of the normal and diseased samples through the detected peaks. The software provides flexibility to the users to accomplish the complete and integrated an...
BMC bioinformatics, 2006
A cluster analysis is the most commonly performed procedure (often regarded as a first step) on a... more A cluster analysis is the most commonly performed procedure (often regarded as a first step) on a set of gene expression profiles. In most cases, a post hoc analysis is done to see if the genes in the same clusters can be functionally correlated. While past successes of such analyses have often been reported in a number of microarray studies (most of which used the standard hierarchical clustering, UPGMA, with one minus the Pearson's correlation coefficient as a measure of dissimilarity), often times such groupings could be misleading. More importantly, a systematic evaluation of the entire set of clusters produced by such unsupervised procedures is necessary since they also contain genes that are seemingly unrelated or may have more than one common function. Here we quantify the performance of a given unsupervised clustering algorithm applied to a given microarray study in terms of its ability to produce biologically meaningful clusters using a reference set of functional class...
Genetics, 1996
In this paper we use cytonuclear disequilibria to test the neutrality of mtDNA markers. The data ... more In this paper we use cytonuclear disequilibria to test the neutrality of mtDNA markers. The data considered here involve sample frequencies of cytonuclear genotypes subject to both statistical sampling variation as well as genetic sampling variation. First, we obtain the dynamics of the sample cytonuclear disequilibria assuming random drift alone as the source of genetic sampling variation. Next, we develop a test statistic using cytonuclear disequilibria via the theory of generalized least squares to test the random drift model. The null distribution of the test statistic is shown to be approximately chi-squared using an asymptotic argument as well as computer simulation. Power of the test statistic is investigated under an alternative model with drift and selection. The method is illustrated using data from cage experiments utilizing different cytonuclear genotypes of Drosophila melanogaster. A program for implementing the neutrality test is available upon request.
Journal of acquired immune deficiency syndromes and human retrovirology : official publication of the International Retrovirology Association, Jan 15, 1996
Current Phase III trials are designed to assess only a vaccine candidate's ability to reduce ... more Current Phase III trials are designed to assess only a vaccine candidate's ability to reduce susceptibility to infection or disease, that is, vaccine efficacy for susceptibility (VES). Human immunodeficiency virus (HIV) vaccination, however, may reduce the level of infectiousness of vaccinees who become infected, producing an important indirect reduction in HIV transmission even if the vaccine confers only modest protection against infection. We propose two approaches for augmenting the information of a classic trial for estimating protective efficacy that enable the additional estimation of the vaccine's effect on infectiousness, that is, vaccine efficacy for infectiousness (VEI). In the first augmentation, steady sexual partners of trial participants are recruited but not randomized to vaccine or placebo. Their infection status is monitored throughout the trial. In the second augmentation, the sexual partners are randomized. Through computer simulations and analytic method...
2006 International Conference of the IEEE Engineering in Medicine and Biology Society, 2006
Cluster analysis has become a standard part of gene expression analysis. In this paper, we propos... more Cluster analysis has become a standard part of gene expression analysis. In this paper, we propose a novel semi-supervised approach that offers the same flexibility as that of a hierarchical clustering. Yet it utilizes, along with the experimental gene expression data, common biological information about different genes that is being complied at various public, Web accessible databases. We argue that such an approach is inherently superior than the standard unsupervised approach of grouping genes based on expression data alone. It is shown that our biologically supervised methods produce better clustering results than the corresponding unsupervised methods as judged by the distance from the model temporal profiles. R-codes of the clustering algorithm are available from the authors upon request.
Statistical Analysis of Next Generation Sequencing Data, 2014
Statistical Methodology, 2006
This is a comparative study of various clustering and classification algorithms as applied to dif... more This is a comparative study of various clustering and classification algorithms as applied to differentiate cancer and non-cancer protein samples using mass spectrometry data. Our study demonstrates the usefulness of a feature selection step prior to applying a machine learning tool. A natural and common choice of a feature selection tool is the collection of marginal p-values obtained from t-tests for testing the intensity differences at each m/z ratio in the cancer versus non-cancer samples. We study the effect of selecting a cutoff in terms of the overall Type 1 error rate control on the performance of the clustering and classification algorithms using the significant features. For the classification problem, we also considered m/z selection using the importance measures computed by the Random Forest algorithm of Breiman. Using a data set of proteomic analysis of serum from ovarian cancer patients and serum from cancer-free individuals in the Food and Drug Administration and National Cancer Institute Clinical Proteomics Database, we undertake a comparative study of the net effect of the machine learning algorithm-feature selection tool-cutoff criteria combination on the performance as measured by an appropriate error rate measure.
Proceedings of The National Academy of Sciences, 2000
In recent microarray experiments thousands of gene expressions are simultaneously tested in compa... more In recent microarray experiments thousands of gene expressions are simultaneously tested in comparing samples (e.g., tissue types or experimental conditions). Application of a statistical test, such as the t-test, would lead to a p-value for each gene that reflects the amount of statistical evidence present in the data that the given gene is indeed differentially expressed. We show how to
Computational Statistics, 2014