Prototype based fuzzy classification in clinical proteomics (original) (raw)

Prototype-based fuzzy classification with local relevance for proteomics

Neurocomputing, 2006

In this paper, we extend soft nearest prototype classification by local metric learning and fuzzy classification. Thereby, the metric is determined according to the given classification task. This may be done separately for each prototype or class specific. We apply the method to cancer detection based on proteomic data. r

Classification of mass-spectrometric data in clinical proteomics using learning vector quantization methods

2008

In the present contribution we propose two recently developed classification algorithms for the analysis of massspectrometric dataƧthe supervised neural gas and the fuzzy-labeled self-organizing map. The algorithms are inherently regularizing, which is recommended, for these spectral data because of its high dimensionality and the sparseness for specific problems. The algorithms are both prototype-based such that the principle of characteristic representants is realized. This leads to an easy interpretation of the generated classifcation model. Further, the fuzzy-labeled self-organizing map is able to process uncertainty in data, and classification results can be obtained as fuzzy decisions. Moreover, this fuzzy classification together with the property of topographic mapping offers the possibility of class similarity detection, which can be used for class visualization. We demonstrate the power of both methods for two exemplary examples: the classification of bacteria (listeria types) and neoplastic and non-neoplastic cell populations in breast cancer tissue sections.

A Comparison of Methods for Classifying Clinical Samples Based on Proteomics Data: A Case Study for Statistical and Machine Learning Approaches

PLoS ONE, 2011

The discovery of protein variation is an important strategy in disease diagnosis within the biological sciences. The current benchmark for elucidating information from multiple biological variables is the so called ''omics'' disciplines of the biological sciences. Such variability is uncovered by implementation of multivariable data mining techniques which come under two primary categories, machine learning strategies and statistical based approaches. Typically proteomic studies can produce hundreds or thousands of variables, p, per observation, n, depending on the analytical platform or method employed to generate the data. Many classification methods are limited by an n%p constraint, and as such, require pretreatment to reduce the dimensionality prior to classification. Recently machine learning techniques have gained popularity in the field for their ability to successfully classify unknown samples. One limitation of such methods is the lack of a functional model allowing meaningful interpretation of results in terms of the features used for classification. This is a problem that might be solved using a statistical model-based approach where not only is the importance of the individual protein explicit, they are combined into a readily interpretable classification rule without relying on a black box approach. Here we incorporate statistical dimension reduction techniques Partial Least Squares (PLS) and Principal Components Analysis (PCA) followed by both statistical and machine learning classification methods, and compared them to a popular machine learning technique, Support Vector Machines (SVM). Both PLS and SVM demonstrate strong utility for proteomic classification problems.

Mass Spectrometry-Based Proteomic Data for Cancer Diagnosis using Interval Type-2 Fuzzy System

2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), 2015

An interval type-2 fuzzy logic system is introduced for cancer diagnosis using mass spectrometry-based proteomic data. The fuzzy system is incorporated with a feature extraction procedure that combines wavelet transform and Wilcoxon ranking test. The proposed feature extraction generates feature sets that serve as inputs to the type-2 fuzzy classifier. Uncertainty, noise and outliers that are common in the proteomic data motivate the use of type-2 fuzzy system. Tabu search is applied for structure learning of the fuzzy classifier. Experiments are performed using two benchmark proteomic datasets for the prediction of ovarian and pancreatic cancer. The dominance of the suggested feature extraction as well as type-2 fuzzy classifier against their competing methods is showcased through experimental results. The proposed approach therefore is helpful to clinicians and practitioners as it can be implemented as a medical decision support system in practice.

Analysis of spectral data in clinical proteomics by use of learning vector quantizers

Studies in Computational Intelligence, 2008

Clinical proteomics based on mass spectrometry has gained tremendous visibility in the scientific and clinical community. Machine learning methods are keys for efficient processing of the complex data. One major class are prototype based algorithms. Prototype based vector quantizers or classifiers are intuitive approaches realizing the principle of characteristic representatives for data subsets or decision regions between them. In this contribution we concentrate on recent extensions of specific prototype based methods as universal tools in the light of clinical proteomics. We focus on non-standard metrics and biomarker patterns discovery. In particular, we demonstrate applications of the weighted Euclidean metric and the weighted functional norm (based on weighted L p -norm) or kernelized metrics taking the specific nature of mass-spectra into account. This allows an efficient feature selection, which may be used for biomarker identification. The adaptation of the algorithms to these specific requirements leads to effective tools for knowledge discovery keeping the robustness of the original simple approaches. Fuzzy classification and regression in clinical proteomics by use of such models is considered. The usefulness of the above extensions is shown in the analysis of clinical data obtained from mass spectra.

Comparison of Supervised Classification Methods for Protein Profiling in Cancer Diagnosis

Cancer Informatics, 2007

A key challenge in clinical proteomics of cancer is the identifi cation of biomarkers that could allow detection, diagnosis and prognosis of the diseases. Recent advances in mass spectrometry and proteomic instrumentations offer unique chance to rapidly identify these markers. These advances pose considerable challenges, similar to those created by microarray-based investigation, for the discovery of pattern of markers from high-dimensional data, specifi c to each pathologic state (e.g. normal vs cancer). We propose a three-step strategy to select important markers from high-dimensional mass spectrometry data using surface enhanced laser desorption/ionization (SELDI) technology. The fi rst two steps are the selection of the most discriminating biomarkers with a construction of different classifi ers. Finally, we compare and validate their performance and robustness using different supervised classifi cation methods such as Support Vector Machine, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Neural Networks, Classifi cation Trees and Boosting Trees. We show that the proposed method is suitable for analysing high-throughput proteomics data and that the combination of logistic regression and Linear Discriminant Analysis outperform other methods tested.

Data mining techniques for cancer detection using serum proteomic profiling

Artificial Intelligence in Medicine, 2004

Pathological changes in an organ or tissue may be reflected in proteomic patterns in serum. It is possible that unique serum proteomic patterns could be used to discriminate cancer samples from non-cancer ones. Due to the complexity of proteomic profiling, a higher order analysis such as data mining is needed to uncover the differences in complex proteomic patterns. The objectives of this paper are (1) to briefly review the application of data mining techniques in proteomics for cancer detection/diagnosis; (2) to explore a novel analytic method with different feature selection methods; (3) to compare the results obtained on different datasets and that reported by Petricoin et al. in terms of detection performance and selected proteomic patterns. Methods and material: Three serum SELDI MS data sets were used in this research to identify serum proteomic patterns that distinguish the serum of ovarian cancer cases from non-cancer controls. A support vector machine-based method is applied in this study, in which statistical testing and genetic algorithm-based methods are used for feature selection respectively. Leave-one-out cross validation with receiver operating characteristic (ROC) curve is used for evaluation and comparison of cancer detection performance. Results and conclusions: The results showed that (1) data mining techniques can be successfully applied to ovarian cancer detection with a reasonably high performance; (2) the classification using features selected by the genetic algorithm consistently outperformed those selected by statistical testing in terms of accuracy and robustness; (3) the discriminatory features (proteomic patterns) can be very different from one selection method to another. In other words, the pattern selection and its classification efficiency are highly classifier dependent. Therefore, when using data mining techniques, the discrimination of cancer from normal does not depend solely upon the identity and origination of cancer-related proteins.

An intensity-region driven multi-classifier scheme for improving the classification accuracy of proteomic MS-spectra

Computer Methods and Programs in Biomedicine, 2010

In this study, a pattern recognition system is presented for improving the classification accuracy of MS-spectra by means of gathering information from different MS-spectra intensity regions using a majority vote ensemble combination. The method starts by automatically breaking down all MS-spectra into common intensity regions. Subsequently, the most informative features (m/z values), which might constitute potential significant biomarkers, are extracted from each common intensity region over all the MS-spectra and, finally, normal from ovarian cancer MS-spectra are discriminated using a multi-classifier scheme, with members the Support Vector Machine, the Probabilistic Neural Network and the k-Nearest Neighbour classifiers. Clinical material was obtained from the publicly available ovarian proteomic dataset (8-7-02). To ensure robust and reliable estimates, the proposed pattern recognition system was evaluated using an external cross-validation process. The average overall performance of the system in discriminating normal from cancer ovarian MS-spectra was 97.18% with 98.52% mean sensitivity and 94.84% mean specificity values.

Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data

Journal of Proteomics & Bioinformatics, 2016

Unsupervised classification methods are gaining acceptance in omics studies of complex common diseases, which are often vaguely defined and are likely the collections of disease subtypes. Unsupervised classification based on the molecular signatures identified in omics studies have the potential to reflect molecular mechanisms of the subtypes of the disease and to lead to more targeted and successful interventions for the identified subtypes. Multiple classification algorithms exist but none is ideal for all types of data. Importantly, there are no established methods to estimate sample size in unsupervised classification (unlike power analysis in hypothesis testing). Therefore, we developed a simulation approach allowing comparison of misclassification errors and estimating the required sample size for a given effect size, number, and correlation matrix of the differentially abundant proteins in targeted proteomics studies. All the experiments were performed in silico. The simulated data imitated the expected one from the study of the plasma of patients with lower urinary tract dysfunction with the aptamer proteomics assay Somascan (SomaLogic Inc, Boulder, CO), which targeted 1129 proteins, including 330 involved in inflammation, 180 in stress response, 80 in aging, etc. Three popular clustering methods (hierarchical, k-means, and k-medoids) were compared. K-means clustering performed much better for the simulated data than the other two methods and enabled classification with misclassification error below 5% in the simulated cohort of 100 patients based on the molecular This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Cancer proteomics: From identification of novel markers to creation of artifical learning models for tumor classification

Electrophoresis, 2000

Studies of global protein expression in human tumors have led to the identification of various polypeptide markers, potentially useful as diagnostic tools. Many changes in gene expression recorded between benign and malignant human tumors are due to post-translational modifications, not detected by analyses of RNA. Proteome analyses have also yielded information about tumor heterogeneity and the degree of relatedness between primary tumors and their metastases. Results from our own studies have shown a similar pattern of changes in protein expression in different epithelial tumors, such as decreases in tropomyosin and cytokeratin expression and increases in proliferating cell nuclear antigen (PCNA) and heat shock protein expression. Such information has been used to create artificial learning models for tumor classification. The artificial learning approach has potential to improve tumor diagnosis and cancer treatment prediction.